CN112131463A

CN112131463A - Hot spot extraction method, storage medium and server

Info

Publication number: CN112131463A
Application number: CN202010950134.5A
Authority: CN
Inventors: 江永渡; 邵陈杰; 赵志武; 程德生; 厉屹; 林镇杰; 钱刚; 朱文; 章冬红
Original assignee: China Soft Hangzhou Anren Network Communication Co ltd
Current assignee: China Soft Hangzhou Anren Network Communication Co ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-12-25

Abstract

The invention provides a hot spot extraction method, a storage medium and a server, wherein the hot spot extraction method comprises the following steps: obtaining corpus data of a hot spot to be extracted; pre-training the roberta model in the general field according to the corpus data to obtain the roberta model in the professional field; extracting a feature vector of each text in a plurality of texts according to a roberta model in the professional field; constructing a training sample of the twin network according to the feature vector of each text in the plurality of texts; adjusting parameters of a roberta model in the professional field in a twin network mode according to the training sample to obtain a roberta model of the target; extracting a feature vector of each text in the plurality of texts according to the roberta model of the target; clustering a plurality of texts of the material data by using a clustering algorithm to obtain a plurality of categories; and extracting hot spots of the corpus data according to the roberta model of the target and a plurality of categories. Unique information in the corpus data is captured better; reducing the condition of unk (unknown) of the words in the corpus data; and the accuracy of clustering and hot spot extraction is improved.

Description

Hot spot extraction method, storage medium and server

Technical Field

The invention relates to the technical field of natural language processing, in particular to a hot spot extraction method, a storage medium and a server.

Background

In the modern times, with the development of network information technology, text data is rapidly increased, and hot topics are timely and accurately mined from the information by automatically processing the information by using a computer, so that the method has great significance for understanding the latest public opinion hotspots, researching paroxysmal hotspots and mastering the trend of future hot topics. The current method for automatically extracting hot spots generally adopts a clustering algorithm to obtain corresponding categories, and then extracts keywords for each category as the final hot spot result. And the method is limited by clustering effect, returned topn hot spots always have the same category in meaning, and the hot spot problem extraction effect is not ideal.

Disclosure of Invention

The invention provides a hot spot extraction method, a storage medium and a server, which are used for improving the accuracy of hot spot extraction.

In a first aspect, the present invention provides a hot spot extraction method, including:

obtaining corpus data of a hot spot to be extracted, wherein the corpus data comprises a plurality of texts;

pre-training the roberta model in the general field according to the corpus data to obtain the roberta model in the professional field;

extracting a feature vector of each text in a plurality of texts according to a roberta model in the professional field;

constructing a training sample of the twin network according to the feature vector of each text in the plurality of texts;

adjusting parameters of a roberta model in the professional field in a twin network mode according to the training sample to obtain a roberta model of the target;

extracting a feature vector of each text in the plurality of texts according to the roberta model of the target;

clustering a plurality of texts of the material data by using a clustering algorithm to obtain a plurality of categories;

and extracting hot spots of the corpus data according to the roberta model of the target and a plurality of categories.

In the scheme, after the roberta model in the general field is pre-trained by utilizing the corpus data of the hot spot to be extracted, the roberta model can better capture unique information in the corpus data of the hot spot to be extracted. And the situation of unk (unbown) of the words in the corpus data of the hot spot to be extracted can be reduced by pre-training, and a foundation is laid for the subsequent hot spot for more accurately extracting the corpus data. In addition, the parameter of the roberta model is adjusted in a twin network mode, the similarity distance can be directly optimized, similar texts are more compact in distance, and dissimilar texts are more dispersed in distance, so that the accuracy of subsequent clustering is improved, and the accuracy of subsequent hot spots in the corpus data is improved.

In a specific embodiment, according to the roberta model in the professional field, extracting a feature vector of each text in a plurality of texts specifically comprises: taking the last set number layers of the roberta model in the professional field, and averaging the feature vectors of each character of each layer in the set number layers; and adding the feature vectors of the set number layers, and taking the average value of the feature vectors to obtain the feature vector of each text. By fusing the feature vectors of the last layers, the feature representation of the obtained text can be improved, the feature vectors of the text have complete semantic information, and the subsequent clustering effect and the accuracy of hot spot extraction are improved.

In a specific embodiment, constructing the training sample of the twin network according to the feature vector of each text in the plurality of texts comprises: forming a plurality of text pairs by each text in the plurality of texts and each text except the text, wherein each text pair comprises two texts; calculating the similarity between the feature vectors of the two texts in each text pair; sequencing the plurality of text pairs according to the similarity from high to low; selecting a text pair with the similarity at the previous first proportional value as a positive sample of the training sample; and selecting the text pair with the similarity at the second rear proportion value as a negative sample of the training sample.

In a specific embodiment, calculating the similarity between the feature vectors of the two texts in each text pair specifically includes: and calculating the cosine similarity between the feature vectors of the two texts in each text pair.

In a specific embodiment, adjusting parameters of the roberta model in the professional field by means of the twin network according to the training sample to obtain the roberta model of the target comprises: constructing two roberta models sharing parameters according to the roberta models in the professional field; respectively inputting two texts in each text pair of a positive sample and a negative sample of a training sample into roberta models of two shared parameters, and outputting feature vectors of the two texts; calculating cosine similarity between the feature vectors of the two texts; and monitoring and optimizing parameters of the roberta models sharing the parameters through a coherent loss function to obtain the roberta model of the target. So as to adjust the parameters of the roberta model in the professional field.

In one specific embodiment, the coherent loss function is:

wherein y represents a label value of 1 for a positive exemplar, or a label value of 0 for a negative exemplar;

d represents the cosine distance of two texts in the text pair of the positive sample or the negative sample;

margin represents the set cosine distance interval. To fine tune parameters of the roberta model in the professional domain.

In a specific embodiment, the clustering algorithm is a density-based DBSCAN clustering algorithm. The number of categories does not need to be determined, the categories can be combined in the subsequent processing operation, and the clustering effect is improved.

In a specific embodiment, clustering is performed on a plurality of texts of the material data by using a clustering algorithm, and the obtained plurality of categories specifically include:

step 1: selecting an N _ Sample parameter of a DBSCAN clustering algorithm as 1, and clustering a plurality of texts of the text data;

step 2: deleting outliers to obtain clustering results with a plurality of different categories;

and step 3: averaging the feature vectors of the texts in each category of the clustering result to obtain the centroid of each category of the clustering result;

and 4, step 4: calculating the cosine similarity between the centroid and the centroid between any two different categories of the clustering result; if the cosine similarity between the center of mass and the center of mass between the two categories is larger than a set threshold value, merging the two categories;

and repeating the steps 3-4 until the cosine similarity between the center of mass and the center of mass between any two categories is not greater than a set threshold, and outputting a plurality of categories. So as to reduce the similarity between different categories and improve the clustering effect.

In a specific embodiment, the hot spot for extracting corpus data according to the roberta model of the target and the multiple categories is specifically: and extracting the text with the highest semantic score and the highest keyword score in each category of the plurality of categories as the hot spot of the corpus data according to a method for combining the semantic score and the keyword score. The meaning of each category can be expressed more intuitively by using the text to replace the keywords for showing.

In a specific embodiment, according to a method of combining a semantic score and a keyword score, extracting a text with the highest semantic score and the highest keyword score in each of a plurality of categories as a hotspot of corpus data specifically includes:

averaging the feature vectors of the texts contained in each category to obtain the centroid of each category;

calculating the cosine similarity from the centroid to each text in the category to obtain the semantic score of each text in the category;

performing word segmentation on texts contained in each category;

extracting keywords of each category by a tf-idf method;

sorting according to importance, and selecting keywords with the top n importance;

obtaining keyword quantity characteristics according to the quantity of keywords contained in each text in the category;

dividing the number characteristics of the keywords by n to obtain the keyword score of each text in the category;

selecting a text with the highest sum score of the semantic score and the keyword score in each category as a template of the category;

sequencing the templates of the multiple categories according to the number of the text pieces contained in each category;

and selecting the template corresponding to the first h categories as the hot spot of the corpus data.

The template in the categories with the number of texts arranged in the first few digits is selected as the hot-spot template, and the accuracy of extracting the template is improved.

In a second aspect, the present invention also provides a storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute any one of the hot spot extraction methods described above.

In a third aspect, the present invention further provides a server, where the server includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute any one of the hot spot extraction methods by calling the computer program stored in the memory.

Drawings

Fig. 1 is a flowchart of a hot spot extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of adjusting parameters of a roberta model in a professional field by a twin network according to a training sample according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For convenience of understanding the hot spot extraction method provided by the embodiment of the present invention, an application scenario of the hot spot extraction method provided by the embodiment of the present invention is first described below, where the hot spot extraction method is used to extract a hot spot in corpus data. The hot spot extraction method is described in detail below with reference to the accompanying drawings.

Referring to fig. 1, the hot spot extraction method provided by the embodiment of the present invention includes:

s10: obtaining corpus data of a hot spot to be extracted, wherein the corpus data comprises a plurality of texts;

s20: pre-training the roberta model in the general field according to the corpus data to obtain the roberta model in the professional field;

s30: extracting a feature vector of each text in a plurality of texts according to a roberta model in the professional field;

s40: constructing a training sample of the twin network according to the feature vector of each text in the plurality of texts;

s50: adjusting parameters of a roberta model in the professional field in a twin network mode according to the training sample to obtain a roberta model of the target;

s60: extracting a feature vector of each text in the plurality of texts according to the roberta model of the target;

s70: clustering a plurality of texts of the material data by using a clustering algorithm to obtain a plurality of categories;

s80: and extracting hot spots of the corpus data according to the roberta model of the target and a plurality of categories.

In the scheme, after the roberta model in the general field is pre-trained by utilizing the corpus data of the hot spot to be extracted, the roberta model can better capture unique information in the corpus data of the hot spot to be extracted. And the situation of unk (unbown) of the words in the corpus data of the hot spot to be extracted can be reduced by pre-training, and a foundation is laid for the subsequent hot spot for more accurately extracting the corpus data. In addition, the parameter of the roberta model is adjusted in a twin network mode, the similarity distance can be directly optimized, similar texts are more compact in distance, and dissimilar texts are more dispersed in distance, so that the accuracy of subsequent clustering is improved, and the accuracy of subsequent hot spots in the corpus data is improved. Each of the above steps will be described in detail with reference to the accompanying drawings.

Firstly, corpus data of a hot spot to be extracted is obtained, wherein the corpus data comprises a plurality of texts. The corpus data may be a piece of sports news, financial news, military news, social news, entertainment news, historical news, and the like.

And then, pre-training the roberta model in the general field according to the corpus data to obtain the roberta model in the professional field. The roberta model in the general field can be a general roberta model trained on databases such as Chinese Wikipedia, Baidu encyclopedia, Sina and microblog. The method for pre-training the roberta model in the general field to obtain the roberta model in the professional field by specifically adopting the corpus data of the hot spot to be extracted is a pre-training method in the prior art. The general roberta model can perform character-level cutting according to the vocab.txt file, if words which are not contained in the vocab.txt exist in the corpus data of the hot spot to be extracted, the words are added into the vocab.txt, and the embedding layer of the general roberta model is expanded according to the length of the vocab.txt.

Next, a feature vector of each of the plurality of texts is extracted according to a roberta model of the professional field. During specific extraction, the last set number layers of the roberta model in the professional field can be taken firstly, and the feature vector of each character of each layer in the set number layers is averaged; and then adding the feature vectors of the set number layers, and taking the average value of the feature vectors to obtain the feature vector of each text. By fusing the feature vectors of the last layers, the feature representation of the obtained text can be improved, the feature vectors of the text have complete semantic information, and the subsequent clustering effect and the accuracy of hot spot extraction are improved. Specifically, when the number of the last set number layers is determined, the last set number layers may be the last 2 layers, the last 3 layers, the last 4 layers, the last 5 layers, and the like. The following formula can be used for calculation:

wherein V represents the output feature vector;

i represents a certain layer of the robert model in the professional field, the value range is 1 to the number of the last set number layers, and the value range is 4;

j represents a certain character input into the roberta model, the value range of j is 1 to m, and m represents the length of the text;

w represents the feature vector of a certain character input to the roberta model.

The roberta model is a 12-layer transformer model, and the extracted feature vector of each word contains semantic information. The feature vectors of the sentences (namely each text) are obtained in an averaging mode, the semantic information of the sentences can be obtained, the features of the last layers are slightly different, the obtained feature representation can be improved by fusing the features of the last layers, the obtained feature representation can have complete semantic information, and the subsequent clustering effect and the accuracy of hot spot extraction are improved.

Next, a training sample of the twin network is constructed according to the feature vector of each of the plurality of texts. When a training sample is specifically constructed, each text in a plurality of texts and each text except the text can be combined into a plurality of text pairs, and each text pair comprises two texts; then, calculating the similarity between the feature vectors of the two texts in each text pair; then, sequencing a plurality of text pairs according to the similarity from high to low; then, selecting a text pair with the similarity at the previous first proportional value as a positive sample of the training sample; and then selecting the text pair with the similarity at the second rear proportion value as a negative sample of the training sample. Wherein, the first ratio value can be 5%, 10%, 15%, 20%, etc. The second ratio may be 5%, 10%, 15%, 20%, etc. And the first proportional value and the second proportional value may be equal or unequal. The accuracy of parameters of the subsequent fine-tuning roberta model is improved by constructing a positive sample and a negative sample.

In addition, when the similarity between the feature vectors of the two texts in each text pair is calculated, the cosine similarity between the feature vectors of the two texts in each text pair can be calculated, and the similarity expressed by adopting the Euclidean distance between the feature vectors of the two texts can also be calculated; the similarity between the feature vectors of two texts, which is expressed by using the manhattan distance, can also be calculated.

And then, adjusting parameters of the roberta model in the professional field in a twin network mode according to the training sample to obtain the roberta model of the target. Referring to fig. 2, when parameters of the roberta model in the professional field are specifically adjusted, two roberta models sharing parameters may be constructed according to the roberta model in the professional field; then, respectively inputting the two texts in each text pair of the positive sample and the negative sample of the training sample into roberta models of two shared parameters, and outputting feature vectors of the two texts; then, calculating cosine similarity between the feature vectors of the two texts; and then, supervising and optimizing the parameters of the roberta models sharing the parameters through a coherent loss function to obtain the roberta model of the target. By adopting the mode, the parameters of the roberta model in the professional field can be adjusted conveniently.

Specifically, when the above-mentioned constrained loss is determined, the constrained loss function may be:

margin denotes a set cosine distance interval, and margin may be set to 0.75, 0.80, 0.85, 0.90, or the like. To fine tune parameters of the roberta model in the professional domain. The learning rate of the coherent loss function can be set to be 5e-6 so as to carry out fine adjustment on the parameters of the roberta model in the professional field.

Next, feature vectors of each of the plurality of texts are extracted according to the roberta model of the target. Because the roberta model of the target adjusts the parameters of the roberta model in a twin network mode, the similarity distance can be directly optimized, similar texts are more compact in distance, and dissimilar texts are more dispersed in distance, so that the accuracy of subsequent clustering is improved, and the accuracy of hot spots in subsequent extraction of corpus data is improved.

And then, clustering a plurality of texts of the material data by using a clustering algorithm to obtain a plurality of categories. The clustering algorithm may be a density-based DBSCAN clustering algorithm. In the clustering process of the DBSCAN clustering algorithm, the number of categories does not need to be determined, so that the categories can be combined in the subsequent processing operation, and the clustering effect is improved.

Specifically, clustering is performed on a plurality of texts of the data by using a clustering algorithm, and the obtained plurality of categories can be:

And then, extracting hot spots of the corpus data according to the roberta model and the multiple categories of the target. When the hot spot is extracted specifically, the text with the highest semantic score and the highest keyword score in each of the multiple categories can be extracted as the hot spot of the corpus data according to a method of combining the semantic score and the keyword score. The semantic features of the text can be omitted in the hot spot extraction process, and the accuracy of hot spot extraction is improved. And the text is used for showing instead of the keywords, so that the meaning of each category can be expressed more intuitively.

Specifically, when a method of combining semantic scores and keyword scores is adopted to extract a text with the highest semantic score and keyword score in each of a plurality of categories as a hotspot of corpus data, the following method can be adopted: firstly, averaging the feature vectors of texts contained in each category to obtain the centroid of each category; then, calculating the cosine similarity from the centroid to each text in the category to obtain the semantic score of each text in the category; then, performing word segmentation on the text contained in each category; then, extracting keywords of each category by a tf-idf method; then, sorting according to the importance, and selecting the keywords with the importance of the top n; then, obtaining the number characteristics of the keywords according to the number of the keywords contained in each text in the category; then, dividing the number characteristics of the keywords by n to obtain the keyword score of each text in the category; then, selecting a text with the highest sum score of the semantic score and the keyword score in each category as a template of the category; then, according to the number of the text pieces contained in each category, sequencing the templates of the multiple categories; and then selecting the template corresponding to the first h categories containing the number of the text pieces as the hot spot of the corpus data. In the prior art, the hot spot of topn is displayed by a keyword method, the meaning of the category cannot be intuitively understood, and the display effect is greatly influenced if the keyword is not accurately extracted. In the invention, the template with the text number arranged in the first few categories in the categories is selected as the hot spot template, so that the accuracy of extracting the template is improved.

Wherein, the above sorting according to importance, the number of "n" in the keywords with importance of top n can be 3, 4, 5, 6, etc. The accuracy of the extracted template can be improved by combining the semantic and keyword evaluation indexes.

The specific manner of sorting the templates of the multiple categories according to the number of the text pieces contained in each category is as follows: the more the number of the text pieces contained in the category, the more the templates in the category are ranked in front; the fewer the number of text pieces contained in a category, the more the templates in that category are ranked behind. The number of the selected text pieces is that the 'h' in the template corresponding to the first h categories can take values of 2, 3, 4, 5, 6 and the like.

After the roberta model in the general field is pre-trained by utilizing the corpus data of the hot spot to be extracted, the roberta model can better capture unique information in the corpus data of the hot spot to be extracted. And the situation of unk (unbown) of the words in the corpus data of the hot spot to be extracted can be reduced by pre-training, and a foundation is laid for the subsequent hot spot for more accurately extracting the corpus data. In addition, the parameter of the roberta model is adjusted in a twin network mode, the similarity distance can be directly optimized, similar texts are more compact in distance, and dissimilar texts are more dispersed in distance, so that the accuracy of subsequent clustering is improved, and the accuracy of subsequent hot spots in the corpus data is improved.

In addition, an embodiment of the present invention further provides a storage medium, where a computer program is stored in the storage medium, and when the computer program runs on a computer, the computer is caused to execute any of the hot spot extraction methods described above. The above description is referred to for the effect, and the description is omitted here.

In addition, an embodiment of the present invention further provides a server, where the server includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute any one of the hot spot extraction methods by calling the computer program stored in the memory. The above description is referred to for the effect, and the description is omitted here.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A hotspot extraction method is characterized by comprising the following steps:

pre-training a roberta model in the general field according to the corpus data to obtain a roberta model in the professional field;

extracting a feature vector of each text in the plurality of texts according to the roberta model of the professional field;

adjusting parameters of the roberta model in the professional field in a twin network mode according to the training sample to obtain a roberta model of a target;

clustering the texts of the corpus data by using a clustering algorithm to obtain a plurality of categories;

and extracting hot spots of the corpus data according to the roberta model of the target and the categories.

2. The hot spot extraction method of claim 1, wherein the extracting the feature vector of each of the plurality of texts according to the roberta model of the professional field specifically comprises:

taking the last set number layers of the roberta model in the professional field, and averaging the feature vectors of each word of each layer in the set number layers;

and adding the feature vectors of the set number layers, and taking the average value of the feature vectors to obtain the feature vector of each text.

3. The method of claim 1, wherein constructing the training sample of the twin network from the feature vector of each of the plurality of texts comprises:

forming a plurality of text pairs by each text in the plurality of texts and each text except the text, wherein each text pair comprises two texts;

calculating the similarity between the feature vectors of the two texts in each text pair;

sorting the plurality of text pairs according to the similarity from high to low;

selecting a text pair with the similarity at the previous first proportional value as a positive sample of the training sample;

and selecting the text pair with the similarity at the second rear proportion value as a negative sample of the training sample.

4. The method for extracting hot spots according to claim 3, wherein the calculating the similarity between the feature vectors of the two texts in each text pair specifically comprises:

and calculating the cosine similarity between the feature vectors of the two texts in each text pair.

5. The hot spot extraction method as claimed in claim 4, wherein the step of adjusting parameters of the roberta model of the professional field by means of the twin network according to the training samples to obtain the roberta model of the target comprises:

constructing two roberta models sharing parameters according to the roberta models in the professional field;

respectively inputting two texts in each text pair of the positive sample and the negative sample of the training sample into the roberta models of the two shared parameters, and outputting the feature vectors of the two texts;

calculating cosine similarity between the feature vectors of the two texts;

and supervising and optimizing the parameters of the roberta models of the two shared parameters through a coherent loss function to obtain the roberta model of the target.

6. The hotspot extraction method of claim 5, wherein the coherent loss function is:

wherein y represents a label value of 1 for the positive exemplar, or a label value of 0 for the negative exemplar;

margin represents the set cosine distance interval.

7. The hot spot extraction method of claim 1, wherein the clustering algorithm is a density-based DBSCAN clustering algorithm.

8. The method according to claim 7, wherein the clustering algorithm is used to cluster the plurality of texts of the corpus data to obtain a plurality of categories, specifically:

step 1: selecting an N _ Sample parameter of the DBSCAN clustering algorithm as 1, and clustering a plurality of texts of the corpus data;

and 4, step 4: calculating cosine similarity between the center of mass and the center of mass between any two different categories of the clustering result; if the cosine similarity between the center of mass and the center of mass between the two categories is larger than a set threshold value, merging the two categories;

and repeating the steps 3-4 until the cosine similarity between the center of mass and the center of mass between any two categories is not greater than the set threshold, and outputting the categories.

9. The method of claim 1, wherein the hot spot extracting the corpus data according to the roberta model of the target and the plurality of categories is specifically:

and extracting the text with the highest semantic score and the highest keyword score in each category of the plurality of categories as the hot spot of the corpus data according to a method for combining the semantic score and the keyword score.

10. The method for extracting hotspots according to claim 9, wherein the method for extracting the text with the highest semantic score and keyword score in each of the plurality of categories as the hotspot of the corpus data according to the combination of the semantic score and the keyword score specifically comprises:

performing word segmentation on texts contained in each category;

extracting keywords of each category by a tf-idf method;

dividing the keyword quantity characteristics by n to obtain the keyword score of each text in the category;

11. A storage medium having stored therein a computer program which, when run on a computer, causes the computer to execute the hotspot extraction method according to any one of claims 1 to 10.

12. A server, characterized by comprising a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the hotspot extracting method according to any one of claims 1-10 by calling the computer program stored in the memory.