CN111723297B - Dual-semantic similarity judging method for grid society situation research and judgment - Google Patents

Dual-semantic similarity judging method for grid society situation research and judgment Download PDF

Info

Publication number
CN111723297B
CN111723297B CN201911144452.6A CN201911144452A CN111723297B CN 111723297 B CN111723297 B CN 111723297B CN 201911144452 A CN201911144452 A CN 201911144452A CN 111723297 B CN111723297 B CN 111723297B
Authority
CN
China
Prior art keywords
feature vector
similarity
corpus
model
discrimination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911144452.6A
Other languages
Chinese (zh)
Other versions
CN111723297A (en
Inventor
钱华
姜永华
钱建华
王巧荣
房查
张宏斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Fablesoft Co ltd
Political And Legal Committee Of Nantong Municipal Committee Of Communist Party Of China
Original Assignee
Jiangsu Fablesoft Co ltd
Political And Legal Committee Of Nantong Municipal Committee Of Communist Party Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Fablesoft Co ltd, Political And Legal Committee Of Nantong Municipal Committee Of Communist Party Of China filed Critical Jiangsu Fablesoft Co ltd
Priority to CN201911144452.6A priority Critical patent/CN111723297B/en
Publication of CN111723297A publication Critical patent/CN111723297A/en
Application granted granted Critical
Publication of CN111723297B publication Critical patent/CN111723297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a dual semantic similarity judging method oriented to grid society, which comprises the following steps: step 1) obtaining a training corpus; step 2) inputting a training corpus, and step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics to generate an intermediate discrimination result; step 4) linearly combining the intermediate discrimination results; step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2; step 6) performing parameter tuning on the linear discrimination model by using a Sigmoid function according to the secondary calculation result, and step 7) extracting a feature vector a and a feature vector b of text big data to be discriminated, which are newly collected from a multi-source webpage, by using a BERT model; step 8), executing a trained similarity discrimination model cx1+dx2 on the input text feature vector a and the text feature vector b; step 9) storing the similarity discrimination result into the HBASE.

Description

Dual-semantic similarity judging method for grid society situation research and judgment
Technical Field
The invention relates to a similarity judging method, in particular to a dual semantic similarity judging method for grid society, and belongs to the technical field of big data public opinion analysis.
Background
Public opinion analysis is an important means for carrying out grid society judgment in government commission grid-type social management work, but the public opinion analysis often involves labeling and similarity analysis of text data from web pages, and the similarity judgment result is not ideal all the time because of being limited by text feature extraction technology, and the similarity judgment method is also optimized all the time along with breakthrough development of feature extraction technology.
In the existing similarity discrimination technology, a corpus is usually manually marked to form a training sample; then, based on the training samples, calculating the similarity by adopting a vector cosine included angle or a vector Euclidean distance, and carrying out similarity discrimination model training; and finally, carrying out similarity discrimination on the new text by using the trained similarity discrimination model.
The above process can be seen that, in the face of public opinion big data, manual labeling requires great cost, and meanwhile, a single semantic similarity calculation model cannot always obtain an accurate similarity discrimination result, so that a new scheme is urgently needed to solve the technical problems.
Disclosure of Invention
The invention provides a dual semantic similarity judging method for grid society research and judgment aiming at the problems in the prior art, and the scheme adopts dual analysis of specific semantics and abstract semantics, so that the problem that single semantic analysis is not strong in applicability to public opinion data in the prior art is solved.
In order to achieve the above purpose, the technical scheme of the invention is as follows, and the dual semantic similarity judging method for grid society situation research and judgment is characterized by comprising the following steps:
step 1) obtaining a training corpus;
step 2) inputting a training corpus, and extracting feature vectors of corpus pairs in the training corpus by using a BERT model
a and a feature vector b;
step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics, and generating an intermediate discrimination result;
step 4) linearly combining the intermediate discrimination results;
step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2;
step 6) utilizing the secondary calculation result, performing parameter tuning on the linear discrimination model through a Sigmoid function, and generating a final similarity discrimination model;
step 7) extracting a feature vector a and a feature vector b of the text big data to be distinguished, which are newly collected from the multi-source webpage, by utilizing the BERT model;
step 8) executing a trained similarity discrimination model on the input text feature vector a and the text feature vector b
cX1+dX2;
Step 9) storing the similarity discrimination result into the HBASE.
As an improvement of the present invention, the step 1) specifically comprises the following steps:
step 1-1, acquiring text big data from multi-source webpages such as microblogs, major news websites, key forums and the like based on preset business keywords to form an initial public opinion corpus;
step 1-2, inputting an initial public opinion corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the initial public opinion corpus by using a BERT model;
step 1-3, performing similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics;
the similarity calculation model in the step 1-3 carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b respectively;
step 1-4, comparing the similarity calculation result with a manually set threshold value, and filtering out the corpus in which the specific semantic and abstract semantic similarity calculation result in the public opinion corpus is obviously not in the threshold value range;
in the step 1-4, the threshold value is set depending on the result of similarity calculation on partial corpora in the initial public opinion corpus;
step 1-5, performing artificial similarity labeling on the filtered corpus pairs;
the corpus texts marked by the manual similarity in the step 1-5 form a training corpus.
As an improvement of the present invention, the similarity calculation model in the step 3 performs euclidean distance calculation and cosine angle calculation on the input feature vector a and the feature vector b, respectively.
As an improvement of the present invention, the step 4 linearly combines the intermediate discrimination results, specifically as follows:
the intermediate discrimination result in the step 4 is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, and a further calculation result is expressed as a Boolean value, wherein 0 is similar, and 1 is dissimilar.
As an improvement of the present invention, in the step 5, the c and d values in cx1+dx2 are finally determined by Sigmoid function training.
As an improvement of the present invention, in the step 6, the parameter tuning is performed on the linear discriminant model by using the secondary calculation result through the Sigmoid function, and the final similarity discriminant model is generated.
As an improvement of the present invention, the values of the similarity discrimination models c and d in the step 8) have been determined by a Sigmoid function; wherein the value of c is 0.9 and the value of d is 0.1.
Compared with the prior art, the invention has the following technical effects: 1) According to the technical scheme, the dual analysis of concrete semantics and abstract semantics solves the problem that in the prior art, the applicability of single semantic analysis to public opinion data is not strong; 2) The secondary judgment of the dual semantic similarity result in the technical scheme solves the problem that the accuracy is low in the primary judgment in the prior art; 3) In the scheme, the corpus is accurately filtered, so that the problem that the large data text corpus needs to be traversed in the prior art is solved; 4) According to the invention, accurate and efficient discrimination of the similarity of the public opinion corpus can be realized by means of the precisely filtered training corpus and by utilizing a text similarity discrimination model combining specific semantics and abstract semantics and combining secondary discrimination of a linear model, so that the accuracy of discrimination of the similarity of the corpus text is improved, the big data computing resources for carrying the training of the similarity discrimination model are saved, and the time for manually labeling the big data public opinion training corpus is shortened.
Description of the drawings:
FIG. 1 is a flow chart of a dual semantic similarity judging method for grid society-oriented research and judgment according to the invention
FIG. 2 is a flow chart of the training corpus building method of the present invention.
The specific embodiment is as follows:
the present invention will be described in detail with reference to examples.
Example 1: referring to fig. 1 and 2, a dual semantic similarity discrimination method for grid society scenario research and discrimination includes the following steps:
step 1) obtaining a training corpus;
step 2) inputting a training corpus, and extracting feature vectors of corpus pairs in the training corpus by using a BERT model
a and a feature vector b;
step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics, and generating an intermediate discrimination result;
step 4) linearly combining the intermediate discrimination results;
step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2;
step 6) utilizing the secondary calculation result, performing parameter tuning on the linear discrimination model through a Sigmoid function, and generating a final similarity discrimination model;
step 7) extracting a feature vector a and a feature vector b of the text big data to be distinguished, which are newly collected from the multi-source webpage, by utilizing the BERT model;
step 8) executing a trained similarity discrimination model on the input text feature vector a and the text feature vector b
cX1+dX2;
Step 9) storing the similarity discrimination result into the HBASE.
Wherein, the step 1) is specifically as follows:
step 1-1, acquiring text big data from multi-source webpages such as microblogs, major news websites, key forums and the like based on preset business keywords to form an initial public opinion corpus;
step 1-2, inputting an initial public opinion corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the initial public opinion corpus by using a BERT model;
step 1-3, performing similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics;
the similarity calculation model in the step 1-3 carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b respectively;
step 1-4, comparing the similarity calculation result with a manually set threshold value, and filtering out the corpus in which the specific semantic and abstract semantic similarity calculation result in the public opinion corpus is obviously not in the threshold value range;
in the step 1-4, the threshold value is set depending on the result of similarity calculation on partial corpora in the initial public opinion corpus;
step 1-5, performing artificial similarity labeling on the filtered corpus pairs;
the corpus texts marked by the manual similarity in the step 1-5 form a training corpus.
And 3) the similarity calculation model in the step respectively carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b.
And 4) linearly combining the intermediate discrimination results, wherein the method comprises the following steps:
the intermediate discrimination result in the step 4) is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, and a further calculation result is expressed as a Boolean value, wherein 0 is similar, and 1 is dissimilar.
In the step 5), the values of c and d in the cx1+dx2 are finally determined by training the Sigmoid function.
And in the step 6), the secondary calculation result is utilized, the parameter tuning is carried out on the linear discrimination model through a Sigmoid function, and a final similarity discrimination model is generated.
The values of the similarity discrimination models c and d in the step 8) are determined by a Sigmoid function, wherein the value of c is 0.9, and the value of d is 0.1.
It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.

Claims (2)

1. The dual semantic similarity judging method for the grid society situation research and judgment is characterized by comprising the following steps of:
step 1) obtaining a training corpus;
step 2) inputting a training corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the training corpus by using a BERT model;
step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics, and generating an intermediate discrimination result;
step 4) linearly combining the intermediate discrimination results;
step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2;
step 6) utilizing the secondary calculation result, performing parameter tuning on the linear discrimination model through a Sigmoid function, and generating a final similarity discrimination model;
step 7) extracting a feature vector a and a feature vector b of the text big data to be distinguished, which are newly collected from the multi-source webpage, by utilizing the BERT model;
step 8), executing a trained similarity discrimination model cx1+dx2 on the input text feature vector a and the text feature vector b;
step 9) storing the similarity discrimination result into HBASE;
the step 1) is specifically as follows:
step 1-1, acquiring text big data from microblogs, major news websites and key forum multi-source webpages based on preset business keywords to form an initial public opinion corpus;
step 1-2, inputting an initial public opinion corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the initial public opinion corpus by using a BERT model;
step 1-3, performing similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics;
the similarity calculation model in the step 1-3 carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b respectively;
step 1-4, comparing the similarity calculation result with a manually set threshold value, and filtering out the corpus in which the specific semantic and abstract semantic similarity calculation result in the public opinion corpus is obviously not in the threshold value range;
in the step 1-4, the threshold value is set depending on the result of similarity calculation on partial corpora in the initial public opinion corpus;
step 1-5, performing artificial similarity labeling on the filtered corpus pairs;
the corpus texts after the manual similarity labeling in the steps 1-5 form a training corpus,
in the step 3, the similarity calculation model calculates the Euclidean distance and cosine included angle of the input feature vector a and the feature vector b respectively, calculates the cosine value of the vector included angle and the normalization value of the Euclidean distance of the vector respectively, and takes the similarity value as an intermediate discrimination result;
and 4, linearly combining the intermediate discrimination results, wherein the method specifically comprises the following steps:
the intermediate discrimination result in the step 4 is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, further calculation results are expressed as Boolean values, wherein 0 is similar, 1 is dissimilar,
in the step 5, the values of c and d in the cx1+dx2 are finally determined by training the Sigmoid function,
and in the step 6, the secondary calculation result is utilized, the parameter tuning is carried out on the linear discrimination model through a Sigmoid function, and a final similarity discrimination model is generated.
2. The dual semantic similarity determination method for grid society of research and determination according to claim 1, wherein the values of the similarity determination models c and d in the step 8) are determined by Sigmoid function, wherein the value of c is 0.9 and the value of d is 0.1.
CN201911144452.6A 2019-11-20 2019-11-20 Dual-semantic similarity judging method for grid society situation research and judgment Active CN111723297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911144452.6A CN111723297B (en) 2019-11-20 2019-11-20 Dual-semantic similarity judging method for grid society situation research and judgment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911144452.6A CN111723297B (en) 2019-11-20 2019-11-20 Dual-semantic similarity judging method for grid society situation research and judgment

Publications (2)

Publication Number Publication Date
CN111723297A CN111723297A (en) 2020-09-29
CN111723297B true CN111723297B (en) 2023-05-12

Family

ID=72563929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911144452.6A Active CN111723297B (en) 2019-11-20 2019-11-20 Dual-semantic similarity judging method for grid society situation research and judgment

Country Status (1)

Country Link
CN (1) CN111723297B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076734B (en) * 2021-04-15 2023-01-20 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344236A (en) * 2018-09-07 2019-02-15 暨南大学 One kind being based on the problem of various features similarity calculating method
CN109508379A (en) * 2018-12-21 2019-03-22 上海文军信息技术有限公司 A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN110008323A (en) * 2019-03-27 2019-07-12 北京百分点信息科技有限公司 A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI753034B (en) * 2017-03-31 2022-01-21 香港商阿里巴巴集團服務有限公司 Method, device and electronic device for generating and searching feature vector

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344236A (en) * 2018-09-07 2019-02-15 暨南大学 One kind being based on the problem of various features similarity calculating method
CN109508379A (en) * 2018-12-21 2019-03-22 上海文军信息技术有限公司 A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN110008323A (en) * 2019-03-27 2019-07-12 北京百分点信息科技有限公司 A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于低维语义向量模型的语义相似度度量;蔡圆媛等;中国科学技术大学学报;第46卷(第09期);第719-726页 *

Also Published As

Publication number Publication date
CN111723297A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN110727880B (en) Sensitive corpus detection method based on word bank and word vector model
CN105426539A (en) Dictionary-based lucene Chinese word segmentation method
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN104008166A (en) Dialogue short text clustering method based on form and semantic similarity
CN107102993B (en) User appeal analysis method and device
CN104598535A (en) Event extraction method based on maximum entropy
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN111274814B (en) Novel semi-supervised text entity information extraction method
CN105512347A (en) Information processing method based on geographic topic model
CN105320646A (en) Incremental clustering based news topic mining method and apparatus thereof
CN106682123A (en) Hot event acquiring method and device
CN105975475A (en) Chinese phrase string-based fine-grained thematic information extraction method
CN110188359B (en) Text entity extraction method
CN104504024A (en) Method and system for mining keywords based on microblog content
CN103246644A (en) Method and device for processing Internet public opinion information
CN106681716A (en) Intelligent terminal and automatic classification method of application programs thereof
CN105183742A (en) Resume identification method
CN105426379A (en) Keyword weight calculation method based on position of word
CN111723297B (en) Dual-semantic similarity judging method for grid society situation research and judgment
CN104331443A (en) Industry data source detection method
CN105528341A (en) Term translation mining system and method with field customization function
CN106202033B (en) A kind of adverbial word Word sense disambiguation method and device based on interdependent constraint and knowledge
CN110738987B (en) Keyword retrieval method based on unified representation
CN107491440B (en) Natural language word segmentation construction method and system and natural language classification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant