CN112016307A

CN112016307A - Title generation method of text information, electronic equipment and storage medium

Info

Publication number: CN112016307A
Application number: CN202010813799.1A
Authority: CN
Inventors: 黄崇远
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2020-12-01

Abstract

The application discloses a title generation method of text information, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a text to be processed and a plurality of reference texts; determining the relevancy of the text to be processed and each reference text according to the text to be processed and the word vectors of the plurality of reference texts; and determining a target reference text with the correlation degree meeting preset requirements from the multiple reference texts, and taking the title of the target reference text as the title of the text to be processed. Through the mode, the accuracy and the efficiency of the generation of the text titles to be processed are improved.

Description

Title generation method of text information, electronic equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a title generation method for text information, an electronic device, and a storage medium.

Background

With the rapid development of the internet, the amount of text information such as native advertisements, information articles and the like launched in an Application program (APP) or a website is increasing, and the text information needs to be provided with a title, and the title links with the corresponding text information, so that a user can view the text information by clicking an entry title.

At present, the title of the text information is generated manually by a user based on the knowledge and past experience of the text information, and because the number of the text information is large, the manual generation of the title can reduce the delivery efficiency of the text information and has high production cost.

Disclosure of Invention

A first aspect of an embodiment of the present application provides a title generation method for text information, where the method includes: acquiring a text to be processed and a plurality of reference texts; determining the relevancy of the text to be processed and each reference text according to the text to be processed and the word vectors of the plurality of reference texts; and determining a target reference text with the correlation degree meeting preset requirements from the multiple reference texts, and taking the title of the target reference text as the title of the text to be processed.

A second aspect of the embodiments of the present application provides an electronic device, which includes a processor and a memory connected to the processor, where the memory is used to store program data, and the processor is used to execute the program data to implement a title generation method for text information.

A third aspect of embodiments of the present application provides a computer-readable storage medium having program data stored therein, the program data, when executed by a processor, being configured to implement a title generation method for text information.

The beneficial effect of this application is: different from the situation of the prior art, the method comprises the steps of obtaining a text to be processed and a plurality of reference texts; determining the relevancy of the text to be processed and each reference text according to the text to be processed and the word vectors of the plurality of reference texts; the target reference texts with the relevancy meeting the preset requirements are determined from the multiple reference texts, the titles of the target reference texts are used as the titles of the texts to be processed, manual generation of the titles of the texts to be processed is avoided, manufacturing cost is saved, meanwhile, the putting efficiency of the texts to be processed is improved, and then the titles of the texts to be processed are generated according to the relevancy of the texts to be processed and each reference text, so that the accuracy of the titles of the texts to be processed can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application, the drawings required in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor. Wherein:

fig. 1 is a schematic flowchart of an embodiment of a title generation method for text information provided in the present application;

fig. 2 is a schematic flowchart of another embodiment of a title generation method for text information provided in the present application;

FIG. 3 is a flowchart illustrating an embodiment of step S22 of FIG. 2 provided herein;

FIG. 4 is a flowchart illustrating an embodiment of step S23 of FIG. 2 provided herein;

FIG. 5 is a schematic flow chart diagram illustrating another embodiment of step S22 in FIG. 2 provided herein;

fig. 6 is a flowchart illustrating a title generation method for text information according to another embodiment of the present application;

FIG. 7 is a block diagram of an embodiment of an electronic device provided herein;

FIG. 8 is a block diagram of an embodiment of a computer storage medium provided herein.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first" and "second" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a title generation method for text information provided by the present application. In this embodiment, the main body of the text information title generating method may be a processor.

The method may specifically comprise the steps of:

step S11: and acquiring a text to be processed and a plurality of reference texts.

In this embodiment, the text to be processed is a text of a title to be generated, and the reference text is a text for providing a reference title for the text to be processed. Wherein each reference text has at least one associated title. Alternatively, the title of the reference text may be generated manually or automatically. The number of the reference texts can be set according to actual conditions, and is, for example, 10, 20, 50, and the like.

Specifically, the processor acquires the text to be processed and the multiple reference texts so as to automatically screen out the titles of the texts to be processed from the titles corresponding to the multiple reference texts.

In some embodiments, the text to be processed may be advertisement text and the reference text may be an informational article. Currently, the information flow advertisement with the title list as the core is an important mode in the current advertisement market, for example, in the news information APP, the form of the native advertisement is made similar to the information article, for example, the entry title of the native advertisement needs to be made similar to the title of the information article, so as to avoid that the user is destroyed by the advertisement which appears suddenly when browsing the information, but this also makes the production cost of the native advertisement much higher compared with the general advertisement. In contrast, in the embodiment, the processor can obtain the advertisement text of the title to be generated and the plurality of information articles providing title references, and automatically screen out the title of the appropriate advertisement text from the titles corresponding to the plurality of information articles based on the advertisement text and the plurality of information articles, so that the manual generation of the title of the advertisement text is avoided, the manufacturing cost of the primary advertisement is saved, and the manufacturing efficiency of the primary advertisement is improved.

In other embodiments, both the pending text and the reference text may be informational articles. It can be understood that it is time-consuming and costly to manually generate the headlines of the information articles, so that the headlines of the information articles to be generated can also be obtained by a plurality of information articles for providing headline references, which is not described herein again. In addition, the text to be processed and the reference text can also be a Chinese composition, a paper, and the like.

Step S12: and determining the relevancy of the text to be processed and each reference text according to the word vectors of the text to be processed and the plurality of reference texts.

In some embodiments, the processor may perform word segmentation on the to-be-processed text and the reference text to obtain a plurality of keywords of the to-be-processed text and a plurality of keywords of the reference text, extract partial keywords from the plurality of keywords, obtain a corresponding word vector according to each keyword to obtain a plurality of word vectors of the to-be-processed text and a plurality of word vectors of the reference text, perform pairwise inner product calculation on the plurality of word vectors of the to-be-processed text and the plurality of word vectors of the reference text to obtain a plurality of inner product values, and calculate a sum of the plurality of inner product values to serve as a correlation degree between the to-be-processed text and the reference text. Optionally, the text to be processed and the reference text may be respectively preset with a plurality of keywords, so that corresponding word vectors may be directly obtained according to the preset plurality of keywords without performing word segmentation processing.

In other embodiments, the text to be processed and the plurality of reference texts may preset a plurality of associated word vectors, and the processor may directly obtain the word vectors associated with the text to be processed and the plurality of reference texts, and then perform calculation of the word vectors to obtain the correlation between the text to be processed and each reference text, which is not described herein again.

Step S13: and determining a target reference text with the correlation degree meeting preset requirements from the multiple reference texts, and taking the title of the target reference text as the title of the text to be processed.

In some embodiments, the determining, from the plurality of reference texts, a target reference text with a relevance satisfying a preset requirement may be: and determining the reference text with the highest correlation degree from the plurality of reference texts as a target reference text of the text to be processed. In other embodiments, any reference text with a correlation degree within a preset range may be determined from the multiple reference texts, and the determined reference text is used as a target reference text of the text to be processed.

For example, the relevance of the text a to be processed and the reference text B, C, D, E is 0.52, 0.31, 0.78, and 0.55, and the relevance is sorted in descending order, that is, 0.78, 0.55, 0.52, and 0.31, if the target reference text is determined according to the highest relevance, the reference text D meeting the preset requirement is selected from the four reference texts, if the target reference text is determined according to any reference text with the relevance within the preset range, and the preset range is set to be 0.5 to 0.9, the reference text B, D, E meeting the preset requirement is selected from the four reference texts, and further, the processor may randomly select one reference text from the reference texts B, D, E as the target reference text.

Different from the prior art, the embodiment obtains a text to be processed and a plurality of reference texts; determining the relevancy of the text to be processed and each reference text according to the text to be processed and the word vectors of the plurality of reference texts; determining a target reference text with the correlation degree meeting preset requirements from the multiple reference texts, and taking the title of the target reference text as the title of the text to be processed, so that the manual generation of the title for the text to be processed is avoided, the manufacturing cost is saved, and the release efficiency of the text to be processed is improved; secondly, the relevance between the materials is designed and calculated from the perspective of the overall material expected environment, compared with the method that keywords of the text to be processed are directly input into the model for matching, the logic is more reasonable, the accuracy of the generated title of the text to be processed is higher, the internal relevance between words is considered, and the recall rate of the materials of the matchable article is greatly improved.

Referring to fig. 2 to 4, fig. 2 is a flowchart illustrating another embodiment of a title generation method for text information provided by the present application, fig. 3 is a flowchart illustrating an embodiment of step S22 in fig. 2 provided by the present application, and fig. 4 is a flowchart illustrating an embodiment of step S23 in fig. 2 provided by the present application. In this embodiment, the main body of the text information title generating method may be a processor. The embodiment will be further described by taking the text to be processed as the advertisement text and the reference text as the information article.

The method may specifically comprise the steps of:

step S21: and acquiring a text to be processed and a plurality of reference texts.

In this embodiment, please refer to the description of step S11 in the above embodiment for the description of step S21, which is not repeated herein.

In this embodiment, steps S22, S23, and S24 are further described in step S12 of the above embodiment.

Step S22: and performing word segmentation on the text to be processed and the plurality of reference texts to obtain a first keyword set corresponding to the text to be processed and a second keyword set corresponding to each reference text.

In some embodiments, step S22 may specifically include sub-steps S221 and S222.

Step S221: performing word segmentation on the text to be processed to obtain a plurality of first keywords corresponding to the text to be processed, and performing word segmentation on the plurality of reference texts to obtain a plurality of second keywords corresponding to each reference text.

Optionally, the processor may perform the word segmentation on the text to be processed and the plurality of reference texts by using at least one of a HanLP word segmentation tool, an nlpir word segmentation tool, and a jieba word segmentation tool. Specifically, the text to be processed and the multiple reference texts may be subjected to NLP (Natural Language Processing) and then utilized for word segmentation. The recognition rate of the word segmentation tool HanLP on the entity is high, and the support of the word segmentation tool HanLP on the entity is weak relative to Python; the Chinese academy word segmentation tool nlpir has good word segmentation accuracy, but has poor code migration and high learning cost; the word segmentation tool jieba is a better word segmentation tool at present in the support of Python, and the word segmentation performance is good, the installation is easy, and the use is convenient. It can be understood that the present embodiment is not limited to the three segmentation tools provided above, and the user may select a suitable segmentation tool to perform the above segmentation processing according to the characteristics and actual conditions of each segmentation tool.

In this embodiment, the advertisement text may include an advertisement copy and corresponding branded keywords. In addition, in the process of creating the advertisement, operators mark the advertisement, such as marking keywords or keyword phrases with definite information, such as "sports", "renting", and the like, namely marking keywords, which is helpful for further matching titles. It can be understood that the marking keywords are already segmented keywords, so that the marking keywords may not be segmented when the advertisement text is segmented, so that the text to be processed is segmented in this step to obtain a plurality of first keywords corresponding to the text to be processed may specifically be: and performing word segmentation processing on the advertisement case to obtain a plurality of first keywords corresponding to the advertisement case.

In the present embodiment, the reference text includes an article title and article content. When the reference text is an information article, the article title and the article content are the title and the content of the information article. Optionally, the word segmentation process may be performed on at least one of the article title and the article content. In some embodiments, since the article content has a potential influence on the title of the text to be processed, the article content is participled to consider the gist of the article content, which is helpful for enhancing the judgment of the relationship between the reference text and the text to be processed, so that the article title and the article content can be participled and then the relationship between the article title and the article content and the text to be processed is comprehensively considered.

For example, the title of an information article is "new building phenomenon," house-buying renting house-renting house self-holding, "and the word segmentation processing is performed on the title of the information article to obtain a plurality of second keywords corresponding to the title, namely," building city, "new phenomenon," "house-buying," "renting," "house-renting," "autonomous," and also the word segmentation can be performed on the article content and the advertisement case of the information article in the same way. Optionally, partial paragraphs in the article content and the advertisement case may also be selected for word segmentation, such as the first paragraph, the last paragraph, the center paragraph, and so on. The center paragraph is a paragraph that can represent a text subject.

Step S222: and extracting a preset number of the plurality of first keywords to form a first keyword set, and extracting a preset number of the plurality of second keywords to form a second keyword set.

It can be understood that a plurality of keywords obtained through the word segmentation processing are not all required to be used for calculating the text relevance between the text to be processed and the reference text.

Specifically, the number of first keywords corresponding to the text to be processed and the number of second keywords corresponding to the reference text obtained through the word segmentation processing are both large, if the text relevancy between the text to be processed and the reference text is calculated by using all the first keywords and all the second keywords, the calculation amount is large, so that the consumption of bottom layer calculation resources is large, and the input-output ratio is low, so that the number of the keywords is reduced by extracting a plurality of preset first keywords to form a first keyword set and extracting a plurality of preset second keywords to form a second keyword set, thereby reducing the calculation amount, further reducing the consumption of bottom layer calculation resources, and improving the input-output ratio.

Taking the advertisement text as an example, the method specifically comprises the following steps: and extracting a preset number of the first keywords, and combining the extracted preset number of the first keywords with the marking keywords to form a first keyword set.

In some embodiments, extracting a preset number of the first keywords specifically includes: respectively calculating the importance degree of each first keyword relative to the advertisement copy; and extracting a preset number of first keywords from the plurality of first keywords according to the sequence of the importance degrees from large to small. The importance degree of the first keyword represents the degree of representing the advertisement file, and the greater the importance degree of the first keyword, the more the first keyword can represent the advertisement file.

Correspondingly, taking the news article as an example, the steps specifically include: and extracting a first quantity of second keywords corresponding to article titles from the plurality of second keywords, and extracting a second quantity of second keywords corresponding to article contents from the plurality of second keywords to form a second keyword set.

In some embodiments, the extracting a first number of second keywords corresponding to the article title from the plurality of second keywords specifically includes: respectively calculating the importance degree of each second keyword relative to the article title; and extracting a first number of second keywords corresponding to the article titles from the plurality of second keywords according to the sequence of the importance degrees from large to small. The importance degree of the second keyword represents the degree of representing the article title, and the greater the importance degree of the second keyword, the more the second keyword can represent the article title.

In some embodiments, the extracting a second number of second keywords corresponding to the article content from the plurality of second keywords specifically includes: respectively calculating the importance degree of each second keyword relative to the article content; and extracting a second number of second keywords corresponding to the article content from the plurality of second keywords according to the sequence of the importance degrees from large to small. The importance degree of the second keyword represents the degree of representing the advertisement scheme, and the greater the importance degree of the second keyword, the more the second keyword can represent the article content.

In some embodiments, the importance of each keyword relative to the text may be calculated separately using a TF-IDF algorithm or other similar algorithm. The TF-IDF algorithm is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus, in other words, a word appears more often in an article, and less often in all documents, and is more representative of the article. Therefore, the embodiment can calculate the weight of the keywords through the TF-IDF algorithm, finally determine the important keywords in the article or the advertisement file, extract the important keywords as the core theme representative of the article or the advertisement file to participate in the calculation of the relevance of the article and the advertisement file, and improve the accuracy. The TF-IDF algorithm will be described in detail below:

the formula of the TF-IDF (Term Frequency-Inverse file Frequency) algorithm is composed of two parts, one is TF (Term Frequency) and the other is IDF (Inverse file Frequency).

TF refers to the number of times a given word appears in the document. Generally, the same word may have a higher word frequency in a long document than a short document, regardless of the importance of the word, so the number of times of the word may be normalized (typically the word frequency divided by the total word number of the article) to prevent it from being biased toward a long document. The specific calculation formula is as follows:

the main idea of IDF is: if the number of documents containing a certain term is less, the IDF is larger, and the term is proved to have good category distinguishing capability. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. Wherein, the addition of 1 to the denominator is to avoid the special case that the denominator is 0. The specific calculation formula is as follows:

and finally, TF-IDF is used as a combined formula, namely TF-IDF.

For a specific calculation process of the TF-IDF algorithm, for example, the length of an article is 1000 words, "china", "bees" and "breeding" appear 20 times each, and the word frequency (TF) of the three words is 0.02. Then, statistical findings were made for all articles, assuming a total of 250000 articles, assuming this is the total number of all articles. The article comprising "China" was 62300 in common, the article comprising "bee" was 484 and the article comprising "breed" was 973, and their Inverse Document Frequencies (IDF) and TF-IDF are shown in the following table:

TABLE 2-1 Inverse Document Frequency (IDF) and TF-IDF

	Number of documents containing the word	IDF	TF-IDF
				China (China)	62300	0.603	0.0121
Bee product	484	2.713	0.0543
				Cultivation of fish	973	2.410	0.0482

Therefore, the TF-IDF value of the bee is the highest, the breeding is the second, and the Chinese is the lowest, so that the meaning of the subject can be reflected in the current article by the term of the bee.

Optionally, the ratio of the first keyword to the marking keyword extracted from the advertisement document and the ratio of the second keyword extracted from the article content and the article title may be set according to an actual situation, which is not limited in this embodiment.

In some embodiments, since the scored keywords are manually selected keywords and have a higher reliability relative to the keywords selected by the machine, the extraction of the first keyword from the advertising copy may be combined with all of the scored keywords to form the first set of keywords. Of course, in other embodiments, the first keyword extracted from the advertisement document may be combined with a part of the tagged keywords to form a first keyword set. For example, the number of the marked keywords of each advertisement is fixed to 3, and if 5 keywords are required to be selected from the advertisement in total, 2 first keywords are required to be extracted from the advertisement file; or 2 keywords can be extracted from 3 marking keywords, and then 3 first keywords are extracted from the advertisement file to form a first keyword set of 5 keywords.

In some embodiments, although the number of the second keywords obtained by subjecting the article title to the word segmentation processing is much smaller than the number of the second keywords obtained by subjecting the article content to the word segmentation processing, since the keywords extracted from the article title have a high degree of importance and can relatively represent the gist of the article, the keywords are allocated in a balanced manner when the extraction ratio of the keywords is allocated to the article content and the article title. For example, the first two keywords (Top2) of the article title and the first three keywords (Top3) of the article content of each article constitute 5 words of keywords, i.e., the first number is 2 and the second number is 3.

When the selected marking keywords are repeated with the first keywords extracted from the advertisement file, ranking according to the importance degree of the first keywords in the advertisement file, extracting the repeated number of first keywords, if the repetition still exists, continuing to extract until the extracted first keywords are not repeated with the marking keywords, and stopping extracting. Similarly, there may be a case where the article title and the second keyword extracted from the article content overlap each other, and therefore, when this occurs, a number of second keywords may be extracted again according to the rank of the importance of the article title or the second keyword in the article content, and if there is still overlap, the extraction may be continued until the extracted second keywords do not overlap with the previous second keywords, and the extraction may be stopped.

Step S23: word relevancy of the first keyword set and each second keyword set is determined.

In this embodiment, step S23 may specifically include substeps S231, S232, and S233.

Step S231: and determining a first word vector set corresponding to the first keyword set and a second word vector set corresponding to the second keyword set.

Since the correlation degree cannot be directly calculated between words, in this embodiment, after the keyword is represented by vectorization, mathematical operations, such as calculating the inner product of vectors, can be implemented between words to represent the relationship between two words.

In some embodiments, the keywords in the first keyword set and the second keyword set may be directly converted into corresponding word vectors through a word vector conversion algorithm such as word2 vec. Word2vec is a Word vector generation model for google open source. Wherein, Word2vec and other similar Word vector conversion algorithms can convert each Word into a vector with fixed dimensions, such as 8 dimensions, 32 dimensions and the like.

In other embodiments, the word vector corresponding to the keyword may be searched in a table lookup manner in a word vector database storing correspondence between words and vectors. Specifically, the processor may respectively search a first word vector corresponding to each first keyword in the first keyword set in the word vector database to form a first word vector set, and respectively search a second word vector corresponding to each second keyword in the second keyword set in the word vector database to form a second word vector set. If the keyword which cannot find the corresponding word vector in the word vector database exists, the keyword can be discarded, or a keyword is extracted from an advertisement or an article to replace the keyword, or the keyword can be directly converted into the corresponding word vector by using a word vector conversion algorithm such as word2 vec. By means of the table lookup method, word vectors corresponding to the keywords can be obtained quickly.

It is understood that step S231 is preceded by: performing word segmentation on the historical text to be processed and the historical reference text to obtain a third key word set corresponding to the historical text to be processed and the historical reference text; and determining a third word vector set corresponding to the third key word set, and forming a word vector database according to the third key word set and the third word vector set. The word vector database is shown in the following table:

TABLE 2-2 word vector database

In some embodiments, the historical pending text comprises a historical advertisement copy, a historical advertisement title, and historical branding keywords, and the historical reference text comprises a historical article title and historical article content. It is to be understood that the third keywords in the third set of keywords do not repeat with each other. In other embodiments, since the number of word vectors stored in one word vector database is large, searching is complex, and the difference between keywords in an advertisement and an article is large, an advertisement word vector database and an article word vector database can be respectively constructed according to a historical advertisement and a historical article, and then word vectors corresponding to the keywords are respectively searched in the corresponding word vector databases, so that the searching speed is increased.

Optionally, each third key word in the third key word set may be converted into a vector of fixed dimension by using word2vec algorithm, so as to form a third word vector set. The fixed dimension is, for example, such as 8 dimensions, 32 dimensions, etc. In addition, other similar word vector conversion algorithms may also be used to convert the third key word into a vector with fixed dimensions, which is not limited herein.

Step S232: and performing inner product calculation on each word vector in the first word vector set and each word vector in the second word vector set to obtain a plurality of inner product values.

Step S233: and calculating the sum of the plurality of inner product values as the word relevancy of the first keyword set and each second keyword set.

Specifically, performing inner product calculation on each word vector in the first word vector set and each word vector in the second word vector set respectively may be expressed as:

wherein the content of the first and second substances,

is a word vector in the first set of word vectors,

as word vectors in the second set of word vectors, W_abAs word vectors in the first set of word vectors

And word vectors in the second set of word vectors

Can characterize the word vector by the dot product (inner product) of

Corresponding first keywords and word vectors

A relationship between corresponding second keywords.

Further, based on the relevance of the words, word relevance of the first set of keywords and each second set of keywords may be calculated. Specifically, the word relevancy of the first keyword set and each second keyword set can be determined by calculating the sum of a plurality of inner product values. The specific calculation formula is as follows:

wherein, the inner product value is n, i belongs to {1, 2 … n }, Sim_scAnd ore is the word relevancy of the first keyword set and each second keyword set.

Step S24: and determining the text relevance of the text to be processed and each reference text according to the word relevance.

Specifically, the processor may determine the text relevance of the text to be processed and each reference text according to the word relevance of the first keyword set and each second keyword set. Since the keywords in the first keyword set and the second keyword set may represent the text to be processed and the reference text, respectively, the word relevancy of the first keyword set and each second keyword set may directly represent the text relevancy of the text to be processed and each reference text.

Step S25: and determining a target reference text with the correlation degree meeting preset requirements from the multiple reference texts, and taking the title of the target reference text as the title of the text to be processed.

In this embodiment, please refer to the description of step S13 in the above embodiment for the description of step S25, which is not repeated herein.

In this embodiment, the first keyword set is formed by extracting the first keyword from the marking keyword and the advertisement document, and the combination of the manual execution accuracy and the processing means of the system technology is considered, for example, the expert experience (manual marking) and the standardized processing (system extraction) can be comprehensively referred to, so that the accuracy of the title can be further improved while the title is automatically generated.

Secondly, in the embodiment, not only the article title but also the article content are considered, a second keyword set is formed by extracting part of second keywords from the article title and the article content respectively, and the relevance is calculated, so that the accuracy of the title can be improved.

Furthermore, the relevance among the keywords is calculated in a way of calculating Word vectors by Word segmentation + Word2vec, and then the target reference file is screened according to the comprehensive relevance of the keyword set, compared with a way of substituting the keywords obtained from the file to be processed into a deep learning algorithm model (such as a Seq2Seq model) to directly obtain the title in the related technology, the title generated by the embodiment is more accurate and better in stability, the implementation process occupies less bottom-layer computing resources, and the input and output are higher.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating another embodiment of step S22 in fig. 2 according to the present application.

This embodiment is a description of another embodiment of the keyword extraction in the above embodiment.

Specifically, based on the word vector database in the above embodiment, the step S22: the method includes performing word segmentation on a text to be processed and a plurality of reference texts to obtain a first keyword set corresponding to the text to be processed and a second keyword set corresponding to each reference text. In this embodiment, the step S22 may specifically include substeps S323, S324, S325 and S326.

Step S323: performing word segmentation on the text to be processed to obtain a plurality of first keywords corresponding to the text to be processed, and performing word segmentation on the plurality of reference texts to obtain a plurality of second keywords corresponding to each reference text.

In this embodiment, please refer to the description of step S221 in the above embodiment for the description of step S323, which is not repeated herein.

Step S324: and searching a first word vector corresponding to each first keyword and a second word vector corresponding to each second keyword in a word vector database.

In this embodiment, please refer to the description of step S231 in the above embodiment for the description of step S324, which is not repeated herein.

Step S325: according to the TextRank algorithm, performing inner product calculation on every two first word vectors corresponding to a plurality of first keywords appearing in the sliding window to obtain the correlation degree of each first keyword, and performing inner product calculation on every two second word vectors corresponding to a plurality of second keywords appearing in the sliding window to obtain the correlation degree of each second keyword.

The TextRank is a keyword extraction algorithm and is derived from PageRank, and the calculation principle is as follows: taking N words from the beginning of the text as a calculation, for example, taking 5 words in sequence to calculate, respectively calculating the relationship between every two words, after the relationship between every two words is calculated, moving one word in sequence by sliding a window, and continuing the calculation until the word is moved to the end of the text.

In the original TextRank algorithm, as long as the words and the words are in the same sliding window, the weight of the words is marked as 1, however, in this embodiment, since each word can obtain a corresponding word vector, when the weight between every two words is calculated, the +1 calculation is not performed simply and roughly, but the correlation between the words is quantized, and the correlation can be represented by the vector inner product of the words and the words, that is, a value obtained by multiplying two word vectors is used as the weight of the words and the words, which is more accurate and reasonable, and the semantic distance between the words is considered, so that the reliability and the effectiveness of extracting the text keywords can be improved.

Step S326: and according to the correlation degree of the second keywords, a first keyword set corresponding to each reference text is selected from the plurality of second keywords.

Specifically, the plurality of first keywords may be ranked in order of relevance according to a descending order, and a predetermined number of first keywords with top rank are selected from the plurality of first keywords to form a first keyword set corresponding to the text to be processed. Similarly, the plurality of second keywords may be ranked in order of relevance from large to small, and a predetermined number of second keywords ranked at the top may be selected from the plurality of second keywords to form a second keyword set corresponding to the reference text.

In the embodiment, the keywords are extracted from the text to be processed and the reference text in a TextRank algorithm and word vector mode, the accuracy and the reasonability are better, and the semantic distance between words is considered, so that the reliability and the effectiveness of extracting the text keywords can be improved.

Referring to fig. 6, fig. 6 is a flowchart illustrating a title generation method of text information according to another embodiment of the present application.

Step S41: and acquiring a text to be processed and a plurality of reference texts.

In this embodiment, please refer to the description of step S11 in the above embodiment for the description of step S41, which is not repeated herein.

Step S42: and determining the relevancy of the text to be processed and each reference text according to the word vectors of the text to be processed and the plurality of reference texts.

In this embodiment, please refer to the description of step S12 and/or steps S22, S23, and S24 in the above embodiment for the description of step S42, which is not repeated herein.

Step S43: and acquiring the heat of a plurality of reference texts.

Specifically, the popularity of the plurality of reference texts may be calculated based on at least one of the number of read, praise, and comment numbers of the reference texts.

In some embodiments, weights may also be assigned to the reading number, the like number, and the comment number of the reference text; and calculating the popularity of the multiple reference texts according to the reading numbers, the praise numbers, the comment numbers and the weights of the multiple reference texts. The specific calculation formula is as follows:

Hot_score＝x*R+y*C+z*L (4-1)

wherein, R is reading number, C is comment number, L is praise number, x is weight of reading number, y is weight of comment number, z is weight of praise number, x + y + z is 1, Hot_scoreIs the heat of the reference text.

For example, the article itself has several important attributes, such as reading number, praise number, and comment number, wherein for the article title, the greatest effect of attracting the user to enter is directly reflected on the reading number, so the weight of the reading number can be set to 0.6, the weight of the comment number is 0.25, and the weight of the praise number is 0.15, and the corresponding calculation formula is as follows:

Hot_score＝0.6*R+0.25*C+0.15*L (4-2)

step S44: and determining a target reference text with the correlation and the heat degree meeting preset requirements from the multiple reference texts, and taking the title of the target reference text as the title of the text to be processed.

By the steps, two indexes of the relevance and the heat degree are generated, so that the final target reference text is selected by comprehensively considering the relevance and the heat degree, and the title of the target reference text is used as the title of the text to be processed.

Specifically, the degree of correlation and the degree of heat may be assigned weights; calculating scores of a plurality of reference texts according to the relevance, the popularity and the weight thereof; and determining the reference text with the highest value in the plurality of reference texts as the target reference text. The weight values of the relevancy and the hotness may be assigned according to actual situations, and are not limited herein. The specific calculation formula is as follows:

Score＝e*Sim_score+f*Hot_score (4-3)

wherein Score is the Score of the reference text, Sim_scoreIs the correlation degree of the text to be processed and the reference text, e is the weight of the correlation degree, Hot_scoreF is the weight of the degree of correlation for the degree of hotness of the reference text.

In some embodiments, the core is to ensure the correlation of the scene, so the outer layer weight of the correlation is set to 0.7, the heat is used as an auxiliary selection index, the weight is set to 0.3, and the corresponding calculation formula is as follows:

Score＝0.7*Sim_score+0.3*Hot_score (4-4)

in other embodiments, regression fitting may be performed through big data, then through LR (logistic regression), and finally, the fitted parameters are used as weights, which is more reasonable.

Finally, the title of the article with the highest Score can be selected as the original title of the current advertisement, so that the originality and the relevance of the article content are comprehensively considered, and the heat of the article title is considered.

In the embodiment, when determining the final title of the text to be processed, not only the relevance between the text to be processed and the reference text is considered, but also the popularity of the article title itself is considered, and then the reference text is ranked in a relevance and popularity comprehensive weighting manner to determine the target reference text, so that the accuracy Rate can be improved.

Referring to fig. 7, fig. 7 is a schematic diagram of a frame of an embodiment of an electronic device provided in the present application. The electronic device 500 includes: the electronic device comprises a processor and a memory connected to the processor, the memory being adapted to store program data, the processor being adapted to execute the program data to implement the steps of any of the above-described method embodiments. The electronic device is, for example, a mobile phone, a computer, etc.

In particular, the processor 510 is configured to control itself and the memory 520 to implement the steps in any of the above-described embodiments of the blood flow parameter detection method. Processor 510 may also be referred to as a CPU (Central Processing Unit). Processor 510 may be an integrated circuit chip having signal processing capabilities. The Processor 510 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Additionally, processor 510 may be implemented collectively by multiple integrated circuit chips.

Referring to fig. 8, fig. 8 is a block diagram illustrating an embodiment of a computer storage medium according to the present disclosure. The computer readable storage medium 600 stores program data 610, and the program data 610 is used for implementing the steps of any of the above method embodiments when being executed by a processor.

The computer-readable storage medium 600 may be a medium that can store a computer program, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or may be a server that stores the computer program, and the server can send the stored computer program to another device for running or can run the stored computer program by itself.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A title generation method for text information, the method comprising:

acquiring a text to be processed and a plurality of reference texts;

determining the relevancy of the text to be processed and each reference text according to the word vectors of the text to be processed and a plurality of reference texts;

and determining a target reference text with the correlation degree meeting preset requirements from the plurality of reference texts, and taking the title of the target reference text as the title of the text to be processed.

2. The method of claim 1,

the determining the relevancy of the text to be processed and each reference text according to the word vectors of the text to be processed and a plurality of reference texts comprises:

performing word segmentation on the text to be processed and the plurality of reference texts to obtain a first keyword set corresponding to the text to be processed and a second keyword set corresponding to each reference text;

determining word relevancy of the first keyword set and each second keyword set;

and determining the text relevancy between the text to be processed and each reference text according to the word relevancy.

3. The method of claim 2,

the performing word segmentation processing on the text to be processed and the plurality of reference texts to obtain a first keyword set corresponding to the text to be processed and a second keyword set corresponding to each reference text includes:

performing word segmentation on the text to be processed to obtain a plurality of first keywords corresponding to the text to be processed, and performing word segmentation on the plurality of reference texts to obtain a plurality of second keywords corresponding to each reference text;

and extracting a preset number of the plurality of first keywords to form a first keyword set, and extracting a preset number of the plurality of second keywords to form a second keyword set.

4. The method of claim 3,

the text to be processed is an advertisement text, and the advertisement text comprises an advertisement file and corresponding marking keywords;

the word segmentation processing is performed on the text to be processed to obtain a plurality of first keywords corresponding to the text to be processed, and the word segmentation processing includes:

performing word segmentation processing on the advertisement case to obtain a plurality of first keywords corresponding to the advertisement case;

the extracting of the preset number of the plurality of first keywords to form a first keyword set comprises the following steps:

and extracting a preset number of the first keywords, and combining the extracted preset number of the first keywords with the marking keywords to form a first keyword set.

5. The method of claim 4,

the extracting of the preset number of the plurality of first keywords comprises the following steps:

respectively calculating the importance degree of each first keyword relative to the advertisement copy;

and extracting a preset number of first keywords from the plurality of first keywords according to the sequence of the importance degrees from large to small.

6. The method of claim 3,

the reference text comprises article titles and article contents;

the extracting of the preset number of the plurality of second keywords forms a second keyword set, and the extracting includes:

and extracting a first quantity of second keywords corresponding to the article titles from the plurality of second keywords, and extracting a second quantity of second keywords corresponding to the article contents from the plurality of second keywords to form a second keyword set.

7. The method of claim 6,

the extracting a first number of the second keywords corresponding to the article title from the plurality of second keywords comprises:

respectively calculating the importance degree of each second keyword relative to the article title;

and extracting a first number of second keywords corresponding to the article titles from the plurality of second keywords according to the sequence of the importance degrees from large to small.

The extracting a second number of the second keywords corresponding to the article content from the plurality of second keywords comprises:

respectively calculating the importance degree of each second keyword relative to the article content;

and extracting a second number of second keywords corresponding to the article content from the plurality of second keywords according to the sequence of the importance degrees from large to small.

8. The method of claim 3,

the word segmentation processing is performed on the text to be processed and the plurality of reference texts, and the word segmentation processing comprises the following steps:

and performing word segmentation on the text to be processed and the plurality of reference texts by utilizing at least one word segmentation tool of a HanLP word segmentation tool, an nlpir word segmentation tool and a jieba word segmentation tool.

9. The method of claim 2,

the determining the word relevancy of the first keyword set and each second keyword set comprises:

determining a first word vector set corresponding to the first keyword set and a second word vector set corresponding to the second keyword set;

performing inner product calculation on each word vector in the first word vector set and each word vector in the second word vector set to obtain a plurality of inner product values;

and calculating the sum of a plurality of inner product values as the word relevancy of the first keyword set and each second keyword set.

10. The method of claim 9,

the determining a first word vector set corresponding to the first keyword set and a second word vector set corresponding to each second keyword set includes:

respectively searching a first word vector corresponding to each first keyword in the first keyword set in a word vector database to form a first word vector set;

and respectively searching a second word vector corresponding to each second keyword in the second keyword set in the word vector database to form a second word vector set.

11. The method of claim 10, further comprising:

performing word segmentation on a historical text to be processed and a historical reference text to obtain a third keyword set corresponding to the historical text to be processed and the historical reference text;

determining a third word vector set corresponding to the third key word set;

and forming a word vector database according to the third key word set and the third word vector set.

12. The method of claim 11,

the determining a third word vector set corresponding to the third keyword set includes:

and respectively converting each third key word in the third key word set into a vector with fixed dimensionality by using a word2vec algorithm to form a third word vector set.

13. The method of claim 2,

searching a first word vector corresponding to each first keyword and a second word vector corresponding to each second keyword in a word vector database;

according to a TextRank algorithm, carrying out inner product calculation on every two first word vectors corresponding to a plurality of first keywords appearing in a sliding window to obtain the correlation degree of each first keyword, and carrying out inner product calculation on every two second word vectors corresponding to a plurality of second keywords appearing in the sliding window to obtain the correlation degree of each second keyword;

and according to the correlation degree of the first keywords, a first keyword set corresponding to the text to be processed is selected from the plurality of first keywords, and according to the correlation degree of the second keywords, a second keyword set corresponding to each reference text is selected from the plurality of second keywords.

14. The method of claim 1, further comprising:

acquiring the heat of the plurality of reference texts;

the determining a target reference text with a correlation degree meeting a preset requirement from the plurality of reference texts and taking the title of the target reference text as the title of the text to be processed further comprises:

and determining a target reference text with the correlation degree and the heat degree meeting preset requirements from the plurality of reference texts, and taking the title of the target reference text as the title of the text to be processed.

15. The method of claim 14,

the obtaining of the popularity of the plurality of reference texts comprises:

calculating the popularity of the plurality of reference texts according to at least one of the reading number, the praise number and the comment number of the reference texts.

16. The method of claim 15,

the calculating the popularity of the plurality of reference texts according to at least one of the reading number, the praise number and the comment number of the reference texts comprises:

distributing weights for the reading number, the praise number and the comment number of the reference text;

and calculating the popularity of the plurality of reference texts according to the reading numbers, the praise numbers, the comment numbers and the weights of the reading numbers, the praise numbers and the comment numbers of the plurality of reference texts.

17. The method of claim 14,

the determining, from the plurality of reference texts, a target reference text with a relevance and a popularity meeting preset requirements includes:

assigning weights to the relevancy and the popularity;

calculating the scores of the plurality of reference texts according to the relevance, the popularity and the weight thereof;

determining the reference text with the highest score in the plurality of reference texts as a target reference text.

18. An electronic device, comprising a processor and a memory coupled to the processor,

the memory is for storing program data, and the processor is for executing the program data to implement the title generation method of text information according to any one of claims 1 to 17.

19. A computer-readable storage medium, characterized in that a program data is stored in the computer-readable storage medium, which program data, when executed by a processor, is adapted to implement the title generation method of text information according to any one of claims 1-17.