CN112131463A - Hot spot extraction method, storage medium and server - Google Patents

Hot spot extraction method, storage medium and server Download PDF

Info

Publication number
CN112131463A
CN112131463A CN202010950134.5A CN202010950134A CN112131463A CN 112131463 A CN112131463 A CN 112131463A CN 202010950134 A CN202010950134 A CN 202010950134A CN 112131463 A CN112131463 A CN 112131463A
Authority
CN
China
Prior art keywords
text
texts
categories
category
roberta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010950134.5A
Other languages
Chinese (zh)
Inventor
江永渡
邵陈杰
赵志武
程德生
厉屹
林镇杰
钱刚
朱文
章冬红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Soft Hangzhou Anren Network Communication Co ltd
Original Assignee
China Soft Hangzhou Anren Network Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Soft Hangzhou Anren Network Communication Co ltd filed Critical China Soft Hangzhou Anren Network Communication Co ltd
Priority to CN202010950134.5A priority Critical patent/CN112131463A/en
Publication of CN112131463A publication Critical patent/CN112131463A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a hot spot extraction method, a storage medium and a server, wherein the hot spot extraction method comprises the following steps: obtaining corpus data of a hot spot to be extracted; pre-training the roberta model in the general field according to the corpus data to obtain the roberta model in the professional field; extracting a feature vector of each text in a plurality of texts according to a roberta model in the professional field; constructing a training sample of the twin network according to the feature vector of each text in the plurality of texts; adjusting parameters of a roberta model in the professional field in a twin network mode according to the training sample to obtain a roberta model of the target; extracting a feature vector of each text in the plurality of texts according to the roberta model of the target; clustering a plurality of texts of the material data by using a clustering algorithm to obtain a plurality of categories; and extracting hot spots of the corpus data according to the roberta model of the target and a plurality of categories. Unique information in the corpus data is captured better; reducing the condition of unk (unknown) of the words in the corpus data; and the accuracy of clustering and hot spot extraction is improved.

Description

Hot spot extraction method, storage medium and server
Technical Field
The invention relates to the technical field of natural language processing, in particular to a hot spot extraction method, a storage medium and a server.
Background
In the modern times, with the development of network information technology, text data is rapidly increased, and hot topics are timely and accurately mined from the information by automatically processing the information by using a computer, so that the method has great significance for understanding the latest public opinion hotspots, researching paroxysmal hotspots and mastering the trend of future hot topics. The current method for automatically extracting hot spots generally adopts a clustering algorithm to obtain corresponding categories, and then extracts keywords for each category as the final hot spot result. And the method is limited by clustering effect, returned topn hot spots always have the same category in meaning, and the hot spot problem extraction effect is not ideal.
Disclosure of Invention
The invention provides a hot spot extraction method, a storage medium and a server, which are used for improving the accuracy of hot spot extraction.
In a first aspect, the present invention provides a hot spot extraction method, including:
obtaining corpus data of a hot spot to be extracted, wherein the corpus data comprises a plurality of texts;
pre-training the roberta model in the general field according to the corpus data to obtain the roberta model in the professional field;
extracting a feature vector of each text in a plurality of texts according to a roberta model in the professional field;
constructing a training sample of the twin network according to the feature vector of each text in the plurality of texts;
adjusting parameters of a roberta model in the professional field in a twin network mode according to the training sample to obtain a roberta model of the target;
extracting a feature vector of each text in the plurality of texts according to the roberta model of the target;
clustering a plurality of texts of the material data by using a clustering algorithm to obtain a plurality of categories;
and extracting hot spots of the corpus data according to the roberta model of the target and a plurality of categories.
In the scheme, after the roberta model in the general field is pre-trained by utilizing the corpus data of the hot spot to be extracted, the roberta model can better capture unique information in the corpus data of the hot spot to be extracted. And the situation of unk (unbown) of the words in the corpus data of the hot spot to be extracted can be reduced by pre-training, and a foundation is laid for the subsequent hot spot for more accurately extracting the corpus data. In addition, the parameter of the roberta model is adjusted in a twin network mode, the similarity distance can be directly optimized, similar texts are more compact in distance, and dissimilar texts are more dispersed in distance, so that the accuracy of subsequent clustering is improved, and the accuracy of subsequent hot spots in the corpus data is improved.
In a specific embodiment, according to the roberta model in the professional field, extracting a feature vector of each text in a plurality of texts specifically comprises: taking the last set number layers of the roberta model in the professional field, and averaging the feature vectors of each character of each layer in the set number layers; and adding the feature vectors of the set number layers, and taking the average value of the feature vectors to obtain the feature vector of each text. By fusing the feature vectors of the last layers, the feature representation of the obtained text can be improved, the feature vectors of the text have complete semantic information, and the subsequent clustering effect and the accuracy of hot spot extraction are improved.
In a specific embodiment, constructing the training sample of the twin network according to the feature vector of each text in the plurality of texts comprises: forming a plurality of text pairs by each text in the plurality of texts and each text except the text, wherein each text pair comprises two texts; calculating the similarity between the feature vectors of the two texts in each text pair; sequencing the plurality of text pairs according to the similarity from high to low; selecting a text pair with the similarity at the previous first proportional value as a positive sample of the training sample; and selecting the text pair with the similarity at the second rear proportion value as a negative sample of the training sample.
In a specific embodiment, calculating the similarity between the feature vectors of the two texts in each text pair specifically includes: and calculating the cosine similarity between the feature vectors of the two texts in each text pair.
In a specific embodiment, adjusting parameters of the roberta model in the professional field by means of the twin network according to the training sample to obtain the roberta model of the target comprises: constructing two roberta models sharing parameters according to the roberta models in the professional field; respectively inputting two texts in each text pair of a positive sample and a negative sample of a training sample into roberta models of two shared parameters, and outputting feature vectors of the two texts; calculating cosine similarity between the feature vectors of the two texts; and monitoring and optimizing parameters of the roberta models sharing the parameters through a coherent loss function to obtain the roberta model of the target. So as to adjust the parameters of the roberta model in the professional field.
In one specific embodiment, the coherent loss function is:
Figure BDA0002675028890000021
wherein y represents a label value of 1 for a positive exemplar, or a label value of 0 for a negative exemplar;
d represents the cosine distance of two texts in the text pair of the positive sample or the negative sample;
margin represents the set cosine distance interval. To fine tune parameters of the roberta model in the professional domain.
In a specific embodiment, the clustering algorithm is a density-based DBSCAN clustering algorithm. The number of categories does not need to be determined, the categories can be combined in the subsequent processing operation, and the clustering effect is improved.
In a specific embodiment, clustering is performed on a plurality of texts of the material data by using a clustering algorithm, and the obtained plurality of categories specifically include:
step 1: selecting an N _ Sample parameter of a DBSCAN clustering algorithm as 1, and clustering a plurality of texts of the text data;
step 2: deleting outliers to obtain clustering results with a plurality of different categories;
and step 3: averaging the feature vectors of the texts in each category of the clustering result to obtain the centroid of each category of the clustering result;
and 4, step 4: calculating the cosine similarity between the centroid and the centroid between any two different categories of the clustering result; if the cosine similarity between the center of mass and the center of mass between the two categories is larger than a set threshold value, merging the two categories;
and repeating the steps 3-4 until the cosine similarity between the center of mass and the center of mass between any two categories is not greater than a set threshold, and outputting a plurality of categories. So as to reduce the similarity between different categories and improve the clustering effect.
In a specific embodiment, the hot spot for extracting corpus data according to the roberta model of the target and the multiple categories is specifically: and extracting the text with the highest semantic score and the highest keyword score in each category of the plurality of categories as the hot spot of the corpus data according to a method for combining the semantic score and the keyword score. The meaning of each category can be expressed more intuitively by using the text to replace the keywords for showing.
In a specific embodiment, according to a method of combining a semantic score and a keyword score, extracting a text with the highest semantic score and the highest keyword score in each of a plurality of categories as a hotspot of corpus data specifically includes:
averaging the feature vectors of the texts contained in each category to obtain the centroid of each category;
calculating the cosine similarity from the centroid to each text in the category to obtain the semantic score of each text in the category;
performing word segmentation on texts contained in each category;
extracting keywords of each category by a tf-idf method;
sorting according to importance, and selecting keywords with the top n importance;
obtaining keyword quantity characteristics according to the quantity of keywords contained in each text in the category;
dividing the number characteristics of the keywords by n to obtain the keyword score of each text in the category;
selecting a text with the highest sum score of the semantic score and the keyword score in each category as a template of the category;
sequencing the templates of the multiple categories according to the number of the text pieces contained in each category;
and selecting the template corresponding to the first h categories as the hot spot of the corpus data.
The template in the categories with the number of texts arranged in the first few digits is selected as the hot-spot template, and the accuracy of extracting the template is improved.
In a second aspect, the present invention also provides a storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute any one of the hot spot extraction methods described above.
In a third aspect, the present invention further provides a server, where the server includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute any one of the hot spot extraction methods by calling the computer program stored in the memory.
Drawings
Fig. 1 is a flowchart of a hot spot extraction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of adjusting parameters of a roberta model in a professional field by a twin network according to a training sample according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For convenience of understanding the hot spot extraction method provided by the embodiment of the present invention, an application scenario of the hot spot extraction method provided by the embodiment of the present invention is first described below, where the hot spot extraction method is used to extract a hot spot in corpus data. The hot spot extraction method is described in detail below with reference to the accompanying drawings.
Referring to fig. 1, the hot spot extraction method provided by the embodiment of the present invention includes:
s10: obtaining corpus data of a hot spot to be extracted, wherein the corpus data comprises a plurality of texts;
s20: pre-training the roberta model in the general field according to the corpus data to obtain the roberta model in the professional field;
s30: extracting a feature vector of each text in a plurality of texts according to a roberta model in the professional field;
s40: constructing a training sample of the twin network according to the feature vector of each text in the plurality of texts;
s50: adjusting parameters of a roberta model in the professional field in a twin network mode according to the training sample to obtain a roberta model of the target;
s60: extracting a feature vector of each text in the plurality of texts according to the roberta model of the target;
s70: clustering a plurality of texts of the material data by using a clustering algorithm to obtain a plurality of categories;
s80: and extracting hot spots of the corpus data according to the roberta model of the target and a plurality of categories.
In the scheme, after the roberta model in the general field is pre-trained by utilizing the corpus data of the hot spot to be extracted, the roberta model can better capture unique information in the corpus data of the hot spot to be extracted. And the situation of unk (unbown) of the words in the corpus data of the hot spot to be extracted can be reduced by pre-training, and a foundation is laid for the subsequent hot spot for more accurately extracting the corpus data. In addition, the parameter of the roberta model is adjusted in a twin network mode, the similarity distance can be directly optimized, similar texts are more compact in distance, and dissimilar texts are more dispersed in distance, so that the accuracy of subsequent clustering is improved, and the accuracy of subsequent hot spots in the corpus data is improved. Each of the above steps will be described in detail with reference to the accompanying drawings.
Firstly, corpus data of a hot spot to be extracted is obtained, wherein the corpus data comprises a plurality of texts. The corpus data may be a piece of sports news, financial news, military news, social news, entertainment news, historical news, and the like.
And then, pre-training the roberta model in the general field according to the corpus data to obtain the roberta model in the professional field. The roberta model in the general field can be a general roberta model trained on databases such as Chinese Wikipedia, Baidu encyclopedia, Sina and microblog. The method for pre-training the roberta model in the general field to obtain the roberta model in the professional field by specifically adopting the corpus data of the hot spot to be extracted is a pre-training method in the prior art. The general roberta model can perform character-level cutting according to the vocab.txt file, if words which are not contained in the vocab.txt exist in the corpus data of the hot spot to be extracted, the words are added into the vocab.txt, and the embedding layer of the general roberta model is expanded according to the length of the vocab.txt.
Next, a feature vector of each of the plurality of texts is extracted according to a roberta model of the professional field. During specific extraction, the last set number layers of the roberta model in the professional field can be taken firstly, and the feature vector of each character of each layer in the set number layers is averaged; and then adding the feature vectors of the set number layers, and taking the average value of the feature vectors to obtain the feature vector of each text. By fusing the feature vectors of the last layers, the feature representation of the obtained text can be improved, the feature vectors of the text have complete semantic information, and the subsequent clustering effect and the accuracy of hot spot extraction are improved. Specifically, when the number of the last set number layers is determined, the last set number layers may be the last 2 layers, the last 3 layers, the last 4 layers, the last 5 layers, and the like. The following formula can be used for calculation:
Figure BDA0002675028890000041
wherein V represents the output feature vector;
i represents a certain layer of the robert model in the professional field, the value range is 1 to the number of the last set number layers, and the value range is 4;
j represents a certain character input into the roberta model, the value range of j is 1 to m, and m represents the length of the text;
w represents the feature vector of a certain character input to the roberta model.
The roberta model is a 12-layer transformer model, and the extracted feature vector of each word contains semantic information. The feature vectors of the sentences (namely each text) are obtained in an averaging mode, the semantic information of the sentences can be obtained, the features of the last layers are slightly different, the obtained feature representation can be improved by fusing the features of the last layers, the obtained feature representation can have complete semantic information, and the subsequent clustering effect and the accuracy of hot spot extraction are improved.
Next, a training sample of the twin network is constructed according to the feature vector of each of the plurality of texts. When a training sample is specifically constructed, each text in a plurality of texts and each text except the text can be combined into a plurality of text pairs, and each text pair comprises two texts; then, calculating the similarity between the feature vectors of the two texts in each text pair; then, sequencing a plurality of text pairs according to the similarity from high to low; then, selecting a text pair with the similarity at the previous first proportional value as a positive sample of the training sample; and then selecting the text pair with the similarity at the second rear proportion value as a negative sample of the training sample. Wherein, the first ratio value can be 5%, 10%, 15%, 20%, etc. The second ratio may be 5%, 10%, 15%, 20%, etc. And the first proportional value and the second proportional value may be equal or unequal. The accuracy of parameters of the subsequent fine-tuning roberta model is improved by constructing a positive sample and a negative sample.
In addition, when the similarity between the feature vectors of the two texts in each text pair is calculated, the cosine similarity between the feature vectors of the two texts in each text pair can be calculated, and the similarity expressed by adopting the Euclidean distance between the feature vectors of the two texts can also be calculated; the similarity between the feature vectors of two texts, which is expressed by using the manhattan distance, can also be calculated.
And then, adjusting parameters of the roberta model in the professional field in a twin network mode according to the training sample to obtain the roberta model of the target. Referring to fig. 2, when parameters of the roberta model in the professional field are specifically adjusted, two roberta models sharing parameters may be constructed according to the roberta model in the professional field; then, respectively inputting the two texts in each text pair of the positive sample and the negative sample of the training sample into roberta models of two shared parameters, and outputting feature vectors of the two texts; then, calculating cosine similarity between the feature vectors of the two texts; and then, supervising and optimizing the parameters of the roberta models sharing the parameters through a coherent loss function to obtain the roberta model of the target. By adopting the mode, the parameters of the roberta model in the professional field can be adjusted conveniently.
Specifically, when the above-mentioned constrained loss is determined, the constrained loss function may be:
Figure BDA0002675028890000051
wherein y represents a label value of 1 for a positive exemplar, or a label value of 0 for a negative exemplar;
d represents the cosine distance of two texts in the text pair of the positive sample or the negative sample;
margin denotes a set cosine distance interval, and margin may be set to 0.75, 0.80, 0.85, 0.90, or the like. To fine tune parameters of the roberta model in the professional domain. The learning rate of the coherent loss function can be set to be 5e-6 so as to carry out fine adjustment on the parameters of the roberta model in the professional field.
Next, feature vectors of each of the plurality of texts are extracted according to the roberta model of the target. Because the roberta model of the target adjusts the parameters of the roberta model in a twin network mode, the similarity distance can be directly optimized, similar texts are more compact in distance, and dissimilar texts are more dispersed in distance, so that the accuracy of subsequent clustering is improved, and the accuracy of hot spots in subsequent extraction of corpus data is improved.
And then, clustering a plurality of texts of the material data by using a clustering algorithm to obtain a plurality of categories. The clustering algorithm may be a density-based DBSCAN clustering algorithm. In the clustering process of the DBSCAN clustering algorithm, the number of categories does not need to be determined, so that the categories can be combined in the subsequent processing operation, and the clustering effect is improved.
Specifically, clustering is performed on a plurality of texts of the data by using a clustering algorithm, and the obtained plurality of categories can be:
step 1: selecting an N _ Sample parameter of a DBSCAN clustering algorithm as 1, and clustering a plurality of texts of the text data;
step 2: deleting outliers to obtain clustering results with a plurality of different categories;
and step 3: averaging the feature vectors of the texts in each category of the clustering result to obtain the centroid of each category of the clustering result;
and 4, step 4: calculating the cosine similarity between the centroid and the centroid between any two different categories of the clustering result; if the cosine similarity between the center of mass and the center of mass between the two categories is larger than a set threshold value, merging the two categories;
and repeating the steps 3-4 until the cosine similarity between the center of mass and the center of mass between any two categories is not greater than a set threshold, and outputting a plurality of categories. So as to reduce the similarity between different categories and improve the clustering effect.
And then, extracting hot spots of the corpus data according to the roberta model and the multiple categories of the target. When the hot spot is extracted specifically, the text with the highest semantic score and the highest keyword score in each of the multiple categories can be extracted as the hot spot of the corpus data according to a method of combining the semantic score and the keyword score. The semantic features of the text can be omitted in the hot spot extraction process, and the accuracy of hot spot extraction is improved. And the text is used for showing instead of the keywords, so that the meaning of each category can be expressed more intuitively.
Specifically, when a method of combining semantic scores and keyword scores is adopted to extract a text with the highest semantic score and keyword score in each of a plurality of categories as a hotspot of corpus data, the following method can be adopted: firstly, averaging the feature vectors of texts contained in each category to obtain the centroid of each category; then, calculating the cosine similarity from the centroid to each text in the category to obtain the semantic score of each text in the category; then, performing word segmentation on the text contained in each category; then, extracting keywords of each category by a tf-idf method; then, sorting according to the importance, and selecting the keywords with the importance of the top n; then, obtaining the number characteristics of the keywords according to the number of the keywords contained in each text in the category; then, dividing the number characteristics of the keywords by n to obtain the keyword score of each text in the category; then, selecting a text with the highest sum score of the semantic score and the keyword score in each category as a template of the category; then, according to the number of the text pieces contained in each category, sequencing the templates of the multiple categories; and then selecting the template corresponding to the first h categories containing the number of the text pieces as the hot spot of the corpus data. In the prior art, the hot spot of topn is displayed by a keyword method, the meaning of the category cannot be intuitively understood, and the display effect is greatly influenced if the keyword is not accurately extracted. In the invention, the template with the text number arranged in the first few categories in the categories is selected as the hot spot template, so that the accuracy of extracting the template is improved.
Wherein, the above sorting according to importance, the number of "n" in the keywords with importance of top n can be 3, 4, 5, 6, etc. The accuracy of the extracted template can be improved by combining the semantic and keyword evaluation indexes.
The specific manner of sorting the templates of the multiple categories according to the number of the text pieces contained in each category is as follows: the more the number of the text pieces contained in the category, the more the templates in the category are ranked in front; the fewer the number of text pieces contained in a category, the more the templates in that category are ranked behind. The number of the selected text pieces is that the 'h' in the template corresponding to the first h categories can take values of 2, 3, 4, 5, 6 and the like.
After the roberta model in the general field is pre-trained by utilizing the corpus data of the hot spot to be extracted, the roberta model can better capture unique information in the corpus data of the hot spot to be extracted. And the situation of unk (unbown) of the words in the corpus data of the hot spot to be extracted can be reduced by pre-training, and a foundation is laid for the subsequent hot spot for more accurately extracting the corpus data. In addition, the parameter of the roberta model is adjusted in a twin network mode, the similarity distance can be directly optimized, similar texts are more compact in distance, and dissimilar texts are more dispersed in distance, so that the accuracy of subsequent clustering is improved, and the accuracy of subsequent hot spots in the corpus data is improved.
In addition, an embodiment of the present invention further provides a storage medium, where a computer program is stored in the storage medium, and when the computer program runs on a computer, the computer is caused to execute any of the hot spot extraction methods described above. The above description is referred to for the effect, and the description is omitted here.
In addition, an embodiment of the present invention further provides a server, where the server includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute any one of the hot spot extraction methods by calling the computer program stored in the memory. The above description is referred to for the effect, and the description is omitted here.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (12)

1. A hotspot extraction method is characterized by comprising the following steps:
obtaining corpus data of a hot spot to be extracted, wherein the corpus data comprises a plurality of texts;
pre-training a roberta model in the general field according to the corpus data to obtain a roberta model in the professional field;
extracting a feature vector of each text in the plurality of texts according to the roberta model of the professional field;
constructing a training sample of the twin network according to the feature vector of each text in the plurality of texts;
adjusting parameters of the roberta model in the professional field in a twin network mode according to the training sample to obtain a roberta model of a target;
extracting a feature vector of each text in the plurality of texts according to the roberta model of the target;
clustering the texts of the corpus data by using a clustering algorithm to obtain a plurality of categories;
and extracting hot spots of the corpus data according to the roberta model of the target and the categories.
2. The hot spot extraction method of claim 1, wherein the extracting the feature vector of each of the plurality of texts according to the roberta model of the professional field specifically comprises:
taking the last set number layers of the roberta model in the professional field, and averaging the feature vectors of each word of each layer in the set number layers;
and adding the feature vectors of the set number layers, and taking the average value of the feature vectors to obtain the feature vector of each text.
3. The method of claim 1, wherein constructing the training sample of the twin network from the feature vector of each of the plurality of texts comprises:
forming a plurality of text pairs by each text in the plurality of texts and each text except the text, wherein each text pair comprises two texts;
calculating the similarity between the feature vectors of the two texts in each text pair;
sorting the plurality of text pairs according to the similarity from high to low;
selecting a text pair with the similarity at the previous first proportional value as a positive sample of the training sample;
and selecting the text pair with the similarity at the second rear proportion value as a negative sample of the training sample.
4. The method for extracting hot spots according to claim 3, wherein the calculating the similarity between the feature vectors of the two texts in each text pair specifically comprises:
and calculating the cosine similarity between the feature vectors of the two texts in each text pair.
5. The hot spot extraction method as claimed in claim 4, wherein the step of adjusting parameters of the roberta model of the professional field by means of the twin network according to the training samples to obtain the roberta model of the target comprises:
constructing two roberta models sharing parameters according to the roberta models in the professional field;
respectively inputting two texts in each text pair of the positive sample and the negative sample of the training sample into the roberta models of the two shared parameters, and outputting the feature vectors of the two texts;
calculating cosine similarity between the feature vectors of the two texts;
and supervising and optimizing the parameters of the roberta models of the two shared parameters through a coherent loss function to obtain the roberta model of the target.
6. The hotspot extraction method of claim 5, wherein the coherent loss function is:
Figure FDA0002675028880000011
wherein y represents a label value of 1 for the positive exemplar, or a label value of 0 for the negative exemplar;
d represents the cosine distance of two texts in the text pair of the positive sample or the negative sample;
margin represents the set cosine distance interval.
7. The hot spot extraction method of claim 1, wherein the clustering algorithm is a density-based DBSCAN clustering algorithm.
8. The method according to claim 7, wherein the clustering algorithm is used to cluster the plurality of texts of the corpus data to obtain a plurality of categories, specifically:
step 1: selecting an N _ Sample parameter of the DBSCAN clustering algorithm as 1, and clustering a plurality of texts of the corpus data;
step 2: deleting outliers to obtain clustering results with a plurality of different categories;
and step 3: averaging the feature vectors of the texts in each category of the clustering result to obtain the centroid of each category of the clustering result;
and 4, step 4: calculating cosine similarity between the center of mass and the center of mass between any two different categories of the clustering result; if the cosine similarity between the center of mass and the center of mass between the two categories is larger than a set threshold value, merging the two categories;
and repeating the steps 3-4 until the cosine similarity between the center of mass and the center of mass between any two categories is not greater than the set threshold, and outputting the categories.
9. The method of claim 1, wherein the hot spot extracting the corpus data according to the roberta model of the target and the plurality of categories is specifically:
and extracting the text with the highest semantic score and the highest keyword score in each category of the plurality of categories as the hot spot of the corpus data according to a method for combining the semantic score and the keyword score.
10. The method for extracting hotspots according to claim 9, wherein the method for extracting the text with the highest semantic score and keyword score in each of the plurality of categories as the hotspot of the corpus data according to the combination of the semantic score and the keyword score specifically comprises:
averaging the feature vectors of the texts contained in each category to obtain the centroid of each category;
calculating the cosine similarity from the centroid to each text in the category to obtain the semantic score of each text in the category;
performing word segmentation on texts contained in each category;
extracting keywords of each category by a tf-idf method;
sorting according to importance, and selecting keywords with the top n importance;
obtaining keyword quantity characteristics according to the quantity of keywords contained in each text in the category;
dividing the keyword quantity characteristics by n to obtain the keyword score of each text in the category;
selecting a text with the highest sum score of the semantic score and the keyword score in each category as a template of the category;
sequencing the templates of the multiple categories according to the number of the text pieces contained in each category;
and selecting the template corresponding to the first h categories as the hot spot of the corpus data.
11. A storage medium having stored therein a computer program which, when run on a computer, causes the computer to execute the hotspot extraction method according to any one of claims 1 to 10.
12. A server, characterized by comprising a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the hotspot extracting method according to any one of claims 1-10 by calling the computer program stored in the memory.
CN202010950134.5A 2020-09-10 2020-09-10 Hot spot extraction method, storage medium and server Pending CN112131463A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010950134.5A CN112131463A (en) 2020-09-10 2020-09-10 Hot spot extraction method, storage medium and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010950134.5A CN112131463A (en) 2020-09-10 2020-09-10 Hot spot extraction method, storage medium and server

Publications (1)

Publication Number Publication Date
CN112131463A true CN112131463A (en) 2020-12-25

Family

ID=73846575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010950134.5A Pending CN112131463A (en) 2020-09-10 2020-09-10 Hot spot extraction method, storage medium and server

Country Status (1)

Country Link
CN (1) CN112131463A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571405A (en) * 2010-12-31 2012-07-11 中国移动通信集团设计院有限公司 Method and device for acquiring resource information
CN103049436A (en) * 2011-10-12 2013-04-17 北京百度网讯科技有限公司 Method and device for obtaining corpus, method and system for generating translation model and method and system for mechanical translation
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system
WO2018086470A1 (en) * 2016-11-10 2018-05-17 腾讯科技(深圳)有限公司 Keyword extraction method and device, and server
CN110347838A (en) * 2019-07-17 2019-10-18 成都医云科技有限公司 Model training method and device are examined by Xian Shang department point
CN111198946A (en) * 2019-12-25 2020-05-26 北京邮电大学 Network news hotspot mining method and device
CN111310041A (en) * 2020-02-12 2020-06-19 腾讯科技(深圳)有限公司 Image-text publishing method, model training method and device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571405A (en) * 2010-12-31 2012-07-11 中国移动通信集团设计院有限公司 Method and device for acquiring resource information
CN103049436A (en) * 2011-10-12 2013-04-17 北京百度网讯科技有限公司 Method and device for obtaining corpus, method and system for generating translation model and method and system for mechanical translation
WO2018086470A1 (en) * 2016-11-10 2018-05-17 腾讯科技(深圳)有限公司 Keyword extraction method and device, and server
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system
CN110347838A (en) * 2019-07-17 2019-10-18 成都医云科技有限公司 Model training method and device are examined by Xian Shang department point
CN111198946A (en) * 2019-12-25 2020-05-26 北京邮电大学 Network news hotspot mining method and device
CN111310041A (en) * 2020-02-12 2020-06-19 腾讯科技(深圳)有限公司 Image-text publishing method, model training method and device and storage medium

Similar Documents

Publication Publication Date Title
CN110162593B (en) Search result processing and similarity model training method and device
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
CN105183833B (en) Microblog text recommendation method and device based on user model
CN110019732B (en) Intelligent question answering method and related device
CN109815336B (en) Text aggregation method and system
CN110008309B (en) Phrase mining method and device
WO2021189951A1 (en) Text search method and apparatus, and computer device and storage medium
JP2005122533A (en) Question-answering system and question-answering processing method
US20140032207A1 (en) Information Classification Based on Product Recognition
CN111291177A (en) Information processing method and device and computer storage medium
CN111797245B (en) Knowledge graph model-based information matching method and related device
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN111460158A (en) Microblog topic public emotion prediction method based on emotion analysis
US11379527B2 (en) Sibling search queries
WO2021000400A1 (en) Hospital guide similar problem pair generation method and system, and computer device
Wei et al. Online education recommendation model based on user behavior data analysis
CN112579729B (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN111859079B (en) Information searching method, device, computer equipment and storage medium
CN109918661B (en) Synonym acquisition method and device
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN112182159B (en) Personalized search type dialogue method and system based on semantic representation
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
CN114547273A (en) Question answering method and related device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination