CN111723297B - Dual-semantic similarity judging method for grid society situation research and judgment - Google Patents
Dual-semantic similarity judging method for grid society situation research and judgment Download PDFInfo
- Publication number
- CN111723297B CN111723297B CN201911144452.6A CN201911144452A CN111723297B CN 111723297 B CN111723297 B CN 111723297B CN 201911144452 A CN201911144452 A CN 201911144452A CN 111723297 B CN111723297 B CN 111723297B
- Authority
- CN
- China
- Prior art keywords
- feature vector
- similarity
- corpus
- model
- discrimination
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a dual semantic similarity judging method oriented to grid society, which comprises the following steps: step 1) obtaining a training corpus; step 2) inputting a training corpus, and step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics to generate an intermediate discrimination result; step 4) linearly combining the intermediate discrimination results; step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2; step 6) performing parameter tuning on the linear discrimination model by using a Sigmoid function according to the secondary calculation result, and step 7) extracting a feature vector a and a feature vector b of text big data to be discriminated, which are newly collected from a multi-source webpage, by using a BERT model; step 8), executing a trained similarity discrimination model cx1+dx2 on the input text feature vector a and the text feature vector b; step 9) storing the similarity discrimination result into the HBASE.
Description
Technical Field
The invention relates to a similarity judging method, in particular to a dual semantic similarity judging method for grid society, and belongs to the technical field of big data public opinion analysis.
Background
Public opinion analysis is an important means for carrying out grid society judgment in government commission grid-type social management work, but the public opinion analysis often involves labeling and similarity analysis of text data from web pages, and the similarity judgment result is not ideal all the time because of being limited by text feature extraction technology, and the similarity judgment method is also optimized all the time along with breakthrough development of feature extraction technology.
In the existing similarity discrimination technology, a corpus is usually manually marked to form a training sample; then, based on the training samples, calculating the similarity by adopting a vector cosine included angle or a vector Euclidean distance, and carrying out similarity discrimination model training; and finally, carrying out similarity discrimination on the new text by using the trained similarity discrimination model.
The above process can be seen that, in the face of public opinion big data, manual labeling requires great cost, and meanwhile, a single semantic similarity calculation model cannot always obtain an accurate similarity discrimination result, so that a new scheme is urgently needed to solve the technical problems.
Disclosure of Invention
The invention provides a dual semantic similarity judging method for grid society research and judgment aiming at the problems in the prior art, and the scheme adopts dual analysis of specific semantics and abstract semantics, so that the problem that single semantic analysis is not strong in applicability to public opinion data in the prior art is solved.
In order to achieve the above purpose, the technical scheme of the invention is as follows, and the dual semantic similarity judging method for grid society situation research and judgment is characterized by comprising the following steps:
step 1) obtaining a training corpus;
step 2) inputting a training corpus, and extracting feature vectors of corpus pairs in the training corpus by using a BERT model
a and a feature vector b;
step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics, and generating an intermediate discrimination result;
step 4) linearly combining the intermediate discrimination results;
step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2;
step 6) utilizing the secondary calculation result, performing parameter tuning on the linear discrimination model through a Sigmoid function, and generating a final similarity discrimination model;
step 7) extracting a feature vector a and a feature vector b of the text big data to be distinguished, which are newly collected from the multi-source webpage, by utilizing the BERT model;
step 8) executing a trained similarity discrimination model on the input text feature vector a and the text feature vector b
cX1+dX2;
Step 9) storing the similarity discrimination result into the HBASE.
As an improvement of the present invention, the step 1) specifically comprises the following steps:
step 1-1, acquiring text big data from multi-source webpages such as microblogs, major news websites, key forums and the like based on preset business keywords to form an initial public opinion corpus;
step 1-2, inputting an initial public opinion corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the initial public opinion corpus by using a BERT model;
step 1-3, performing similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics;
the similarity calculation model in the step 1-3 carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b respectively;
step 1-4, comparing the similarity calculation result with a manually set threshold value, and filtering out the corpus in which the specific semantic and abstract semantic similarity calculation result in the public opinion corpus is obviously not in the threshold value range;
in the step 1-4, the threshold value is set depending on the result of similarity calculation on partial corpora in the initial public opinion corpus;
step 1-5, performing artificial similarity labeling on the filtered corpus pairs;
the corpus texts marked by the manual similarity in the step 1-5 form a training corpus.
As an improvement of the present invention, the similarity calculation model in the step 3 performs euclidean distance calculation and cosine angle calculation on the input feature vector a and the feature vector b, respectively.
As an improvement of the present invention, the step 4 linearly combines the intermediate discrimination results, specifically as follows:
the intermediate discrimination result in the step 4 is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, and a further calculation result is expressed as a Boolean value, wherein 0 is similar, and 1 is dissimilar.
As an improvement of the present invention, in the step 5, the c and d values in cx1+dx2 are finally determined by Sigmoid function training.
As an improvement of the present invention, in the step 6, the parameter tuning is performed on the linear discriminant model by using the secondary calculation result through the Sigmoid function, and the final similarity discriminant model is generated.
As an improvement of the present invention, the values of the similarity discrimination models c and d in the step 8) have been determined by a Sigmoid function; wherein the value of c is 0.9 and the value of d is 0.1.
Compared with the prior art, the invention has the following technical effects: 1) According to the technical scheme, the dual analysis of concrete semantics and abstract semantics solves the problem that in the prior art, the applicability of single semantic analysis to public opinion data is not strong; 2) The secondary judgment of the dual semantic similarity result in the technical scheme solves the problem that the accuracy is low in the primary judgment in the prior art; 3) In the scheme, the corpus is accurately filtered, so that the problem that the large data text corpus needs to be traversed in the prior art is solved; 4) According to the invention, accurate and efficient discrimination of the similarity of the public opinion corpus can be realized by means of the precisely filtered training corpus and by utilizing a text similarity discrimination model combining specific semantics and abstract semantics and combining secondary discrimination of a linear model, so that the accuracy of discrimination of the similarity of the corpus text is improved, the big data computing resources for carrying the training of the similarity discrimination model are saved, and the time for manually labeling the big data public opinion training corpus is shortened.
Description of the drawings:
FIG. 1 is a flow chart of a dual semantic similarity judging method for grid society-oriented research and judgment according to the invention
FIG. 2 is a flow chart of the training corpus building method of the present invention.
The specific embodiment is as follows:
the present invention will be described in detail with reference to examples.
Example 1: referring to fig. 1 and 2, a dual semantic similarity discrimination method for grid society scenario research and discrimination includes the following steps:
step 1) obtaining a training corpus;
step 2) inputting a training corpus, and extracting feature vectors of corpus pairs in the training corpus by using a BERT model
a and a feature vector b;
step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics, and generating an intermediate discrimination result;
step 4) linearly combining the intermediate discrimination results;
step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2;
step 6) utilizing the secondary calculation result, performing parameter tuning on the linear discrimination model through a Sigmoid function, and generating a final similarity discrimination model;
step 7) extracting a feature vector a and a feature vector b of the text big data to be distinguished, which are newly collected from the multi-source webpage, by utilizing the BERT model;
step 8) executing a trained similarity discrimination model on the input text feature vector a and the text feature vector b
cX1+dX2;
Step 9) storing the similarity discrimination result into the HBASE.
Wherein, the step 1) is specifically as follows:
step 1-1, acquiring text big data from multi-source webpages such as microblogs, major news websites, key forums and the like based on preset business keywords to form an initial public opinion corpus;
step 1-2, inputting an initial public opinion corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the initial public opinion corpus by using a BERT model;
step 1-3, performing similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics;
the similarity calculation model in the step 1-3 carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b respectively;
step 1-4, comparing the similarity calculation result with a manually set threshold value, and filtering out the corpus in which the specific semantic and abstract semantic similarity calculation result in the public opinion corpus is obviously not in the threshold value range;
in the step 1-4, the threshold value is set depending on the result of similarity calculation on partial corpora in the initial public opinion corpus;
step 1-5, performing artificial similarity labeling on the filtered corpus pairs;
the corpus texts marked by the manual similarity in the step 1-5 form a training corpus.
And 3) the similarity calculation model in the step respectively carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b.
And 4) linearly combining the intermediate discrimination results, wherein the method comprises the following steps:
the intermediate discrimination result in the step 4) is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, and a further calculation result is expressed as a Boolean value, wherein 0 is similar, and 1 is dissimilar.
In the step 5), the values of c and d in the cx1+dx2 are finally determined by training the Sigmoid function.
And in the step 6), the secondary calculation result is utilized, the parameter tuning is carried out on the linear discrimination model through a Sigmoid function, and a final similarity discrimination model is generated.
The values of the similarity discrimination models c and d in the step 8) are determined by a Sigmoid function, wherein the value of c is 0.9, and the value of d is 0.1.
It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.
Claims (2)
1. The dual semantic similarity judging method for the grid society situation research and judgment is characterized by comprising the following steps of:
step 1) obtaining a training corpus;
step 2) inputting a training corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the training corpus by using a BERT model;
step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics, and generating an intermediate discrimination result;
step 4) linearly combining the intermediate discrimination results;
step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2;
step 6) utilizing the secondary calculation result, performing parameter tuning on the linear discrimination model through a Sigmoid function, and generating a final similarity discrimination model;
step 7) extracting a feature vector a and a feature vector b of the text big data to be distinguished, which are newly collected from the multi-source webpage, by utilizing the BERT model;
step 8), executing a trained similarity discrimination model cx1+dx2 on the input text feature vector a and the text feature vector b;
step 9) storing the similarity discrimination result into HBASE;
the step 1) is specifically as follows:
step 1-1, acquiring text big data from microblogs, major news websites and key forum multi-source webpages based on preset business keywords to form an initial public opinion corpus;
step 1-2, inputting an initial public opinion corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the initial public opinion corpus by using a BERT model;
step 1-3, performing similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics;
the similarity calculation model in the step 1-3 carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b respectively;
step 1-4, comparing the similarity calculation result with a manually set threshold value, and filtering out the corpus in which the specific semantic and abstract semantic similarity calculation result in the public opinion corpus is obviously not in the threshold value range;
in the step 1-4, the threshold value is set depending on the result of similarity calculation on partial corpora in the initial public opinion corpus;
step 1-5, performing artificial similarity labeling on the filtered corpus pairs;
the corpus texts after the manual similarity labeling in the steps 1-5 form a training corpus,
in the step 3, the similarity calculation model calculates the Euclidean distance and cosine included angle of the input feature vector a and the feature vector b respectively, calculates the cosine value of the vector included angle and the normalization value of the Euclidean distance of the vector respectively, and takes the similarity value as an intermediate discrimination result;
and 4, linearly combining the intermediate discrimination results, wherein the method specifically comprises the following steps:
the intermediate discrimination result in the step 4 is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, further calculation results are expressed as Boolean values, wherein 0 is similar, 1 is dissimilar,
in the step 5, the values of c and d in the cx1+dx2 are finally determined by training the Sigmoid function,
and in the step 6, the secondary calculation result is utilized, the parameter tuning is carried out on the linear discrimination model through a Sigmoid function, and a final similarity discrimination model is generated.
2. The dual semantic similarity determination method for grid society of research and determination according to claim 1, wherein the values of the similarity determination models c and d in the step 8) are determined by Sigmoid function, wherein the value of c is 0.9 and the value of d is 0.1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911144452.6A CN111723297B (en) | 2019-11-20 | 2019-11-20 | Dual-semantic similarity judging method for grid society situation research and judgment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911144452.6A CN111723297B (en) | 2019-11-20 | 2019-11-20 | Dual-semantic similarity judging method for grid society situation research and judgment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111723297A CN111723297A (en) | 2020-09-29 |
CN111723297B true CN111723297B (en) | 2023-05-12 |
Family
ID=72563929
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911144452.6A Active CN111723297B (en) | 2019-11-20 | 2019-11-20 | Dual-semantic similarity judging method for grid society situation research and judgment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111723297B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076734B (en) * | 2021-04-15 | 2023-01-20 | 云南电网有限责任公司电力科学研究院 | Similarity detection method and device for project texts |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344236A (en) * | 2018-09-07 | 2019-02-15 | 暨南大学 | One kind being based on the problem of various features similarity calculating method |
CN109508379A (en) * | 2018-12-21 | 2019-03-22 | 上海文军信息技术有限公司 | A kind of short text clustering method indicating and combine similarity based on weighted words vector |
CN110008323A (en) * | 2019-03-27 | 2019-07-12 | 北京百分点信息科技有限公司 | A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing |
CN110032632A (en) * | 2019-04-04 | 2019-07-19 | 平安科技(深圳)有限公司 | Intelligent customer service answering method, device and storage medium based on text similarity |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI753034B (en) * | 2017-03-31 | 2022-01-21 | 香港商阿里巴巴集團服務有限公司 | Method, device and electronic device for generating and searching feature vector |
-
2019
- 2019-11-20 CN CN201911144452.6A patent/CN111723297B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344236A (en) * | 2018-09-07 | 2019-02-15 | 暨南大学 | One kind being based on the problem of various features similarity calculating method |
CN109508379A (en) * | 2018-12-21 | 2019-03-22 | 上海文军信息技术有限公司 | A kind of short text clustering method indicating and combine similarity based on weighted words vector |
CN110008323A (en) * | 2019-03-27 | 2019-07-12 | 北京百分点信息科技有限公司 | A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing |
CN110032632A (en) * | 2019-04-04 | 2019-07-19 | 平安科技(深圳)有限公司 | Intelligent customer service answering method, device and storage medium based on text similarity |
Non-Patent Citations (1)
Title |
---|
基于低维语义向量模型的语义相似度度量;蔡圆媛等;中国科学技术大学学报;第46卷(第09期);第719-726页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111723297A (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110727880B (en) | Sensitive corpus detection method based on word bank and word vector model | |
CN105426539A (en) | Dictionary-based lucene Chinese word segmentation method | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
CN104008166A (en) | Dialogue short text clustering method based on form and semantic similarity | |
CN107102993B (en) | User appeal analysis method and device | |
CN104598535A (en) | Event extraction method based on maximum entropy | |
CN103324700A (en) | Noumenon concept attribute learning method based on Web information | |
CN111274814B (en) | Novel semi-supervised text entity information extraction method | |
CN105512347A (en) | Information processing method based on geographic topic model | |
CN105320646A (en) | Incremental clustering based news topic mining method and apparatus thereof | |
CN106682123A (en) | Hot event acquiring method and device | |
CN105975475A (en) | Chinese phrase string-based fine-grained thematic information extraction method | |
CN110188359B (en) | Text entity extraction method | |
CN104504024A (en) | Method and system for mining keywords based on microblog content | |
CN103246644A (en) | Method and device for processing Internet public opinion information | |
CN106681716A (en) | Intelligent terminal and automatic classification method of application programs thereof | |
CN105183742A (en) | Resume identification method | |
CN105426379A (en) | Keyword weight calculation method based on position of word | |
CN111723297B (en) | Dual-semantic similarity judging method for grid society situation research and judgment | |
CN104331443A (en) | Industry data source detection method | |
CN105528341A (en) | Term translation mining system and method with field customization function | |
CN106202033B (en) | A kind of adverbial word Word sense disambiguation method and device based on interdependent constraint and knowledge | |
CN110738987B (en) | Keyword retrieval method based on unified representation | |
CN107491440B (en) | Natural language word segmentation construction method and system and natural language classification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |