CN111723297B

CN111723297B - Dual-semantic similarity judging method for grid society situation research and judgment

Info

Publication number: CN111723297B
Application number: CN201911144452.6A
Authority: CN
Inventors: 钱华; 姜永华; 钱建华; 王巧荣; 房查; 张宏斌
Original assignee: Jiangsu Fablesoft Co ltd; Political And Legal Committee Of Nantong Municipal Committee Of Communist Party Of China
Current assignee: Jiangsu Fablesoft Co ltd; Political And Legal Committee Of Nantong Municipal Committee Of Communist Party Of China
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2023-05-12
Anticipated expiration: 2039-11-20
Also published as: CN111723297A

Abstract

The invention relates to a dual semantic similarity judging method oriented to grid society, which comprises the following steps: step 1) obtaining a training corpus; step 2) inputting a training corpus, and step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics to generate an intermediate discrimination result; step 4) linearly combining the intermediate discrimination results; step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2; step 6) performing parameter tuning on the linear discrimination model by using a Sigmoid function according to the secondary calculation result, and step 7) extracting a feature vector a and a feature vector b of text big data to be discriminated, which are newly collected from a multi-source webpage, by using a BERT model; step 8), executing a trained similarity discrimination model cx1+dx2 on the input text feature vector a and the text feature vector b; step 9) storing the similarity discrimination result into the HBASE.

Description

Dual-semantic similarity judging method for grid society situation research and judgment

Technical Field

The invention relates to a similarity judging method, in particular to a dual semantic similarity judging method for grid society, and belongs to the technical field of big data public opinion analysis.

Background

Public opinion analysis is an important means for carrying out grid society judgment in government commission grid-type social management work, but the public opinion analysis often involves labeling and similarity analysis of text data from web pages, and the similarity judgment result is not ideal all the time because of being limited by text feature extraction technology, and the similarity judgment method is also optimized all the time along with breakthrough development of feature extraction technology.

In the existing similarity discrimination technology, a corpus is usually manually marked to form a training sample; then, based on the training samples, calculating the similarity by adopting a vector cosine included angle or a vector Euclidean distance, and carrying out similarity discrimination model training; and finally, carrying out similarity discrimination on the new text by using the trained similarity discrimination model.

The above process can be seen that, in the face of public opinion big data, manual labeling requires great cost, and meanwhile, a single semantic similarity calculation model cannot always obtain an accurate similarity discrimination result, so that a new scheme is urgently needed to solve the technical problems.

Disclosure of Invention

The invention provides a dual semantic similarity judging method for grid society research and judgment aiming at the problems in the prior art, and the scheme adopts dual analysis of specific semantics and abstract semantics, so that the problem that single semantic analysis is not strong in applicability to public opinion data in the prior art is solved.

In order to achieve the above purpose, the technical scheme of the invention is as follows, and the dual semantic similarity judging method for grid society situation research and judgment is characterized by comprising the following steps:

step 1) obtaining a training corpus;

step 2) inputting a training corpus, and extracting feature vectors of corpus pairs in the training corpus by using a BERT model

a and a feature vector b;

step 3) performing preliminary similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics, and generating an intermediate discrimination result;

step 4) linearly combining the intermediate discrimination results;

step 5) performing secondary calculation on the intermediate discrimination result through a linear discrimination model cX1+dX2;

step 6) utilizing the secondary calculation result, performing parameter tuning on the linear discrimination model through a Sigmoid function, and generating a final similarity discrimination model;

step 7) extracting a feature vector a and a feature vector b of the text big data to be distinguished, which are newly collected from the multi-source webpage, by utilizing the BERT model;

step 8) executing a trained similarity discrimination model on the input text feature vector a and the text feature vector b

cX1+dX2；

Step 9) storing the similarity discrimination result into the HBASE.

As an improvement of the present invention, the step 1) specifically comprises the following steps:

step 1-1, acquiring text big data from multi-source webpages such as microblogs, major news websites, key forums and the like based on preset business keywords to form an initial public opinion corpus;

step 1-2, inputting an initial public opinion corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the initial public opinion corpus by using a BERT model;

step 1-3, performing similarity calculation on the feature vector a and the feature vector b through a similarity calculation model combining abstract semantics and concrete semantics;

the similarity calculation model in the step 1-3 carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b respectively;

step 1-4, comparing the similarity calculation result with a manually set threshold value, and filtering out the corpus in which the specific semantic and abstract semantic similarity calculation result in the public opinion corpus is obviously not in the threshold value range;

in the step 1-4, the threshold value is set depending on the result of similarity calculation on partial corpora in the initial public opinion corpus;

step 1-5, performing artificial similarity labeling on the filtered corpus pairs;

the corpus texts marked by the manual similarity in the step 1-5 form a training corpus.

As an improvement of the present invention, the similarity calculation model in the step 3 performs euclidean distance calculation and cosine angle calculation on the input feature vector a and the feature vector b, respectively.

As an improvement of the present invention, the step 4 linearly combines the intermediate discrimination results, specifically as follows:

the intermediate discrimination result in the step 4 is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, and a further calculation result is expressed as a Boolean value, wherein 0 is similar, and 1 is dissimilar.

As an improvement of the present invention, in the step 5, the c and d values in cx1+dx2 are finally determined by Sigmoid function training.

As an improvement of the present invention, in the step 6, the parameter tuning is performed on the linear discriminant model by using the secondary calculation result through the Sigmoid function, and the final similarity discriminant model is generated.

As an improvement of the present invention, the values of the similarity discrimination models c and d in the step 8) have been determined by a Sigmoid function; wherein the value of c is 0.9 and the value of d is 0.1.

Compared with the prior art, the invention has the following technical effects: 1) According to the technical scheme, the dual analysis of concrete semantics and abstract semantics solves the problem that in the prior art, the applicability of single semantic analysis to public opinion data is not strong; 2) The secondary judgment of the dual semantic similarity result in the technical scheme solves the problem that the accuracy is low in the primary judgment in the prior art; 3) In the scheme, the corpus is accurately filtered, so that the problem that the large data text corpus needs to be traversed in the prior art is solved; 4) According to the invention, accurate and efficient discrimination of the similarity of the public opinion corpus can be realized by means of the precisely filtered training corpus and by utilizing a text similarity discrimination model combining specific semantics and abstract semantics and combining secondary discrimination of a linear model, so that the accuracy of discrimination of the similarity of the corpus text is improved, the big data computing resources for carrying the training of the similarity discrimination model are saved, and the time for manually labeling the big data public opinion training corpus is shortened.

Description of the drawings:

FIG. 1 is a flow chart of a dual semantic similarity judging method for grid society-oriented research and judgment according to the invention

FIG. 2 is a flow chart of the training corpus building method of the present invention.

The specific embodiment is as follows:

the present invention will be described in detail with reference to examples.

Example 1: referring to fig. 1 and 2, a dual semantic similarity discrimination method for grid society scenario research and discrimination includes the following steps:

step 1) obtaining a training corpus;

a and a feature vector b;

step 4) linearly combining the intermediate discrimination results;

cX1+dX2；

Step 9) storing the similarity discrimination result into the HBASE.

Wherein, the step 1) is specifically as follows:

And 3) the similarity calculation model in the step respectively carries out Euclidean distance calculation and cosine included angle calculation on the input feature vector a and the feature vector b.

And 4) linearly combining the intermediate discrimination results, wherein the method comprises the following steps:

the intermediate discrimination result in the step 4) is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, and a further calculation result is expressed as a Boolean value, wherein 0 is similar, and 1 is dissimilar.

In the step 5), the values of c and d in the cx1+dx2 are finally determined by training the Sigmoid function.

And in the step 6), the secondary calculation result is utilized, the parameter tuning is carried out on the linear discrimination model through a Sigmoid function, and a final similarity discrimination model is generated.

The values of the similarity discrimination models c and d in the step 8) are determined by a Sigmoid function, wherein the value of c is 0.9, and the value of d is 0.1.

It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.

Claims

1. The dual semantic similarity judging method for the grid society situation research and judgment is characterized by comprising the following steps of:

step 1) obtaining a training corpus;

step 2) inputting a training corpus, and extracting a feature vector a and a feature vector b of corpus pairs in the training corpus by using a BERT model;

step 4) linearly combining the intermediate discrimination results;

step 8), executing a trained similarity discrimination model cx1+dx2 on the input text feature vector a and the text feature vector b;

step 9) storing the similarity discrimination result into HBASE;

the step 1) is specifically as follows:

step 1-1, acquiring text big data from microblogs, major news websites and key forum multi-source webpages based on preset business keywords to form an initial public opinion corpus;

the corpus texts after the manual similarity labeling in the steps 1-5 form a training corpus,

in the step 3, the similarity calculation model calculates the Euclidean distance and cosine included angle of the input feature vector a and the feature vector b respectively, calculates the cosine value of the vector included angle and the normalization value of the Euclidean distance of the vector respectively, and takes the similarity value as an intermediate discrimination result;

and 4, linearly combining the intermediate discrimination results, wherein the method specifically comprises the following steps:

the intermediate discrimination result in the step 4 is a linear combination of a Euclidean distance calculation result X1 and a cosine included angle calculation result X2, further calculation results are expressed as Boolean values, wherein 0 is similar, 1 is dissimilar,

in the step 5, the values of c and d in the cx1+dx2 are finally determined by training the Sigmoid function,

and in the step 6, the secondary calculation result is utilized, the parameter tuning is carried out on the linear discrimination model through a Sigmoid function, and a final similarity discrimination model is generated.

2. The dual semantic similarity determination method for grid society of research and determination according to claim 1, wherein the values of the similarity determination models c and d in the step 8) are determined by Sigmoid function, wherein the value of c is 0.9 and the value of d is 0.1.