CN112836486A

CN112836486A - Group hidden-in-field analysis method based on word vectors and Bert

Info

Publication number: CN112836486A
Application number: CN202011451101.2A
Authority: CN
Inventors: 韩旭; 王博; 蒋沁学; 陈根华; 黄博帆
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-05-25
Anticipated expiration: 2040-12-09
Also published as: CN112836486B

Abstract

The invention discloses a group hidden vertical field analysis method based on word vectors and Bert, wherein a Bert model is trained through a text corpus, and group hidden vertical field analysis is carried out by combining sentence vectors; the system comprises a data analysis module, a model training module and an implicit vertical analysis module; the data analysis module is used for analyzing and extracting the speech data published by the users in the social group and classifying the speech of the users according to the target words and the attribute words mentioned in the implicit association test; sentence segmentation is carried out on the extracted text to obtain a statement set A, and then a sentence set B containing both target words and attribute words and a set C not containing the target words and the attribute words are extracted; the model training module is used for constructing a model for learning the language big data text prejudice of the social group; the implicit standpoint analysis module measures the relationship between the corresponding target words and the attribute words according to the distance between the embedded vectors of the sentences, so that the implicit standpoint attitude of the social group users is quantified.

Description

Group hidden-in-field analysis method based on word vectors and Bert

Technical Field

The invention belongs to the field of group language data analysis in social computing and social psychology, and relates to a method for analyzing hidden position attitude in a group, in particular to a method for analyzing hidden position in a group based on word vectors and Bert.

Background

In social media language big data (such as microblog, Twitter, news, wikipedia and the like), the language published by the user can potentially reflect the attitude of the user to things or attitudes. An attitude standpoint is the perspective of an individual or group's propensity toward concepts or objects. Currently, research on the attitude standpoint of individuals or groups is mainly focused on the area of social psychology. The attitude is divided into an external attitude and an internal attitude from the standpoint of attitude, the external attitude is conscious, controllable and easy to report, and the internal attitude is uncontrollable and can not be consciously acquired [1 ]. Meanwhile, languages have also been used to mine attitudes [2-4] by means of natural language processing techniques, and attitudes of individuals or groups to certain events, objects, characters or concepts can be mined by analyzing emotions and semantics of individual or group utterances [5 ].

In psychological studies, implicit association testing requires the active cooperation of the test subjects to conduct the experiment [7] and can only be measured on a small population. Although human beings have had some breakthrough in learning only textual representations, human readers have difficulty judging the integrity of those sentences that are inexperienced and are expressively expressed without a rich background. Learning a biased AI model does so to some extent.

Implicit attitude is the degree of internal attitude of the individual's heart that affects the individual's behavior in an unconscious manner. The implicit association test [9] is one of the major measures of psychologically measuring the implicit attitudes, designed to reliably assess individual attitude differences in a manner that produces large effect sizes [10 ]. Greenwald and Banaji assert that implicit and explicit memory can be applied to studies on individual or community attitudes [11 ]. Implicit association test is a method proposed by Greenwald et al in 1998 to measure the implicit status of a tested object by measuring the association between concept vocabularies and attribute vocabularies. This association can also affect the attitude and behavior of the individual if the memory unavailable for consciousness can affect the individual's actions. Exploiting differences in conceptual associations of individuals helps psychological researchers understand attitudes that cannot be measured by self-reporting assessment due to lack of awareness and social expectation bias [12 ].

Currently, attitude measurement based on attitude analysis of text languages [13-19] mainly relies on explicit expression of attitudes in text, and no in-depth study is conducted on implicit attitudes. The text emotion analysis method is a main method for measuring attitudes. Emotion analysis refers to the use of natural language processing, text analysis, computational linguistics to identify, extract, quantify and study emotional state and subjective information [6], which is intended to determine the author's attitude to a certain topic, or attitude polarity to a document, object or event. The general or Aspect-based attitude [20] expressed in online talk comments can be understood through sentiment analysis, where the attitude may be a sentiment state. Emotion analysis will generally classify opinions in text into categories of "positive", "neutral", and "negative" [21 ]. Generally speaking, the attitude of the research based on emotion analysis usually includes several key factors such as objects, attributes, attitude polarity, attitude holders, and the like. Meanwhile, some resources and techniques in natural language processing are also applied to attitude measurement, such as external dictionary [18] and syntactic analysis [19 ].

In 2017, Caliskan et al proposed a word embedding association test [22] to measure attitudes by linking the strength of association in implicit association tests to semantic distances between words to understand attitude. Word embedding (word embedding) is a semantic representation of words that depends on the context of words in a corpus. In word embedding, words that are closer in vector space should be semantically closer.

In 2019, May et al developed the word embedding association test to measure attitudes, measured from the sentence encoder standpoint, found that the sentence level test is more likely to cause significant associations than the vocabulary level test, and although the vocabulary level test is more effective, although the word embedding association test introduced the concept of classical psychometric measures into automated language analysis, this method also did not distinguish between explicit and implicit attitudes, which allowed the results to be the common effect of the explicit and implicit attitude standpoints, and this could be confused in cases where the explicit and implicit attitudes were different.

First, however, the measurement of the exo-attitudes and endo-attitudes of individuals or groups is mainly focused on psychosocial studies, but the psychology-based measurement methods require active cooperation of subjects, and only a small number of subjects can perform experiments, and cannot be applied to large-scale studies on attitudes of individuals or groups, and also cannot analyze the historical attitudes of subjects. Second, although the attitude measurement methods based on text language are not limited by the number of the tested objects and the tested objects, and can be applied to attitude measurement of large-scale population, the methods mainly depend on explicit expression in the text and do not deeply research the implicit attitude. Meanwhile, the latest attitude measurement method based on word embedding does not distinguish between the external appearance attitude and the internal hiding attitude. The explicit attitude and the implicit attitude have different roles in social life, the clear attitude can be transmitted in public places and forms a mainstream value view, and the implicit attitude in the mind can determine the behavior of an individual without consciousness. Therefore, it is necessary to clearly distinguish the extrinsic and implicit attitudes and perform attitude measurement. Third, word-based methods, such as WEAT or verb extraction, are simpler. They consider only a single word, regardless of the grammar and context. Decision ground at the sentence level has a deeper meaning [23], allowing the calculation of cosine similarities for different sentences, such as the similarity of a question and corresponding answer. The more appropriate a particular answer is for a given question, the higher its cosine similarity. Therefore, it is necessary to extend WEAT to SEAT for field analysis in large data.

[ reference documents ]

[1]Timothy D.Wilson，Samuel Lindsey，Tonya Y.Schooler.2000.A model of dual attitudes, 107(1):101-126.

[2]Sap,M.Prasetio,M.C.,Holtzman,A.,Rashkin,H.,Choi,Y.2017.Connotation Frames of Power and Agency in Modern Films.In Proc.EMNLP 2017

[3]McKenzie,R.M.and E.Carrie,Implicit–explicit attitudinal discrepancy and the investigation of language attitude change in progress.Journal of Multilingual and Multicultural Development,2018. 0(0):p.1-15.

[4]Carpenter,Jordan,Daniel

Lucie Flekova,Salvatore Giorgi,Courtney Hagan, Margaret Kern,Anneke Buffone,Lyle Ungar,Martin Seligman,Real Men don’t say'cute':Using Automatic Language Analysis to Isolate Inaccurate Aspects of Stereotypes,Social Psychological and Personality Science,2016.

[5]Liu,B.2010.Sentiment Analysis and Subjectivity.Handbook of natural language processing,2, 627-666.

[6]Northrup,D.A.1996.The Problem of the Self-Report In Survey Research.Institute for Social Research.11(3)

[7]Stone,A.A.,Turkkan,J.S.,Bachrach,C.A.,Jobe,J.B.,Kurtzman,H.S.,Cain,V.S.2003.The science of self-report:Implications for research and practice.Experimental Psychology 50(3):231-232

[8]Paulhus,D.L.,Vazire,S.2007.The self-report method.Handbook of research methods in personality psychology,1,224-239.

[9]Greenwald,A.G.,McGhee,D.E.,Schwartz,J.L.1998.Measuring individual differences in implicit cognition:the implicit association test.Journal of personality and social psychology,74(6), 1464-1480.

[10]Lane,K.A.,Banaji,M.R.,Nosek,B.A.,Greenwald,A.G.2007.Understanding and Using the Implicit Association Test:IV:What We Know(So Far)about the Method.In B.Wittenbrink&N. Schwarz(Eds.),Implicit measures of attitudes,59-102.New York,NY,US:Guilford Press.

[11]Greenwald,A.G.,Banaji,M.R.1995.Implicit social cognition:attitudes,self-esteem,and stereotypes.Psychological review,102(1),4.

[12]Nosek,B.A.,Greenwald,A.G.,Banaji,M.R.2005.Understanding and using the Implicit Association Test:II.Method variables and construct validity.Personality and Social Psychology Bulletin,31(2),166-180.

[13]Cambria,E.,Poria,S.,Gelbukh,A.,&Thelwall,M.2017.Sentiment analysis is a big suitcase. IEEE Intelligent Systems,32(6),74-80.

[14]Mohammad Tubishat,Norisma Idris,Mohammad A.M.Abushariah,2018.Implicit aspect extraction in sentiment analysis,Information Processing and Management:an International Journal, v.54 n.4,p.545-563.

[15]Khanna,B.,Moses,S.,&Nirmala,M.2018.SoftMax based User Attitude Detection Algorithm for Sentimental Analysis.Procedia Computer Science,125,313-320.

[16]Chaturvedi,I.,Cambria,E.,Welsch,R.E.,&Herrera,F.(2018).Distinguishing between facts and opinions for sentiment analysis:Survey and challenges.Information Fusion,44,65-77.

[17]Wagner,C.,Garcia,D.,Jadidi,M.,&Strohmaier,M.2015.It's a Man's WikipediaAssessing Gender Inequality in an Online Encyclopedia.In ICWSM(pp.454-463).

[18]Hube,C.2017.Bias in Wikipedia.In Proceedings of the 26th International Conference on World Wide Web Companion(pp.717-721).

[19]Christoph Hube and Besnik Fetahu.2018.Detecting Biased Statements in Wikipedia.In Companion Proceedings of the The Web Conference 2018(WWW'18).1779-1786.

[20]Poria,S.,Cambria,E.,Gelbukh,A.2016.Aspect extraction for opinion mining with a deep convolutional neural network.Knowledge-Based Systems,108,42-49.

[21]

I.,

M.,

J.2016.Multilingual Twitter sentiment classification:The role of human annotators.PloS one,11(5),e0155036.

[22]Caliskan,A.,Bryson,J.J.,Narayanan,A.2017.Semantics derived automatically from language corpora contain human-like biases.Science,356(6334),183-186.

[23]Jentzsch,S.；Schramowski,P.；Rothkopf,C.A.；and Kersting,K.2019.Semantics derived automatically from language corpora contain human-like moral choices.In Proceedings of the 2019 AAAI/ACM Conference on AI,Ethics,and Societ

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a method for analyzing the implicit discrete fields in a group based on word vectors and Bert.

The purpose of the invention is realized by the following technical scheme:

a hidden-in-group-based method for analyzing the hidden-in-group ground based on word vectors and Bert trains a Bert model through a text corpus and performs hidden-in-group ground analysis by combining sentence vectors; the system comprises a data analysis module, a model training module and an implicit vertical analysis module;

the data analysis module is used for analyzing and extracting the speech data published by the users in the social group and classifying the speech of the users according to the target words and the attribute words mentioned in the implicit association test; sentence segmentation is carried out on the extracted text to obtain a statement set A, and then a sentence set B containing both target words and attribute words and a set C not containing the target words and the attribute words are extracted;

the model training module is used for constructing a model for learning the language big data text prejudice of the social group; acquiring an embedded vector of each sentence based on a set obtained by the data analysis module according to the target words and the attribute words;

the implicit standpoint analysis module measures the relationship between the corresponding target words and the attribute words according to the distance between the embedded vectors of the sentences, so that the implicit standpoint attitude of the social group users is quantified.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects: the method can quantify the encoding influence of the AI model inheriting the prejudice of people on the real sentence, and can accurately quantify the attitude analysis from two angles of external display and internal hiding because the method is independent of the model due to corpus-level training.

The invention distinguishes between macroscopic and microscopic on a sentence-level basis. By adopting a corpus-level method, the influence of the AI model on the coding of the real sentence after inheriting the prejudice of people can be more fully researched by researching on a macro level and a micro level independently from the model. The invention also performs static and dynamic analysis on the group interior attitude. The implicit vertical field analysis method also provides a thought for analyzing the influence of implicit bias on the integrity of explicit expression.

Drawings

Fig. 1 and 2 are schematic diagrams of the overall framework of the method of the invention.

Dynamic evolution of hidden attitude within the population of figures 3a to 3d over wikipedia and twitter.

FIG. 4 is a dynamic evolution diagram of implicit attitude deviation of a population versus occupational gender.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The method for analyzing the hidden vertical field in the group based on the word vector and the Bert comprises three modules: the system comprises a data analysis module, a model training module and a vertical analysis module.

1. Data analysis module

The data analysis module is used for analyzing and extracting the speech data published by the users in the social group, and the speech classification of the wikipedia and the twitter users is carried out according to the target words and the attribute words mentioned in the implicit association test in the embodiment. And carrying out sentence segmentation on the extracted text to obtain a statement set A, and then extracting a sentence set B containing the target words and the attribute words and a set C not containing the target words and the attribute words. Here we use the wikipediae xtractor to extract text, the python toolkit space to perform word segmentation, and the Stanford syntax analysis tool to analyze sentence structure.

In the word embedding association test, GloVe and word2vec show the implicit standpoint of social groups. In the invention, the word level implicit association test is extended to the sentence level implicit association test. To measure the implicit standpoint of the sentence encoder for the social group, the method creates a test whose target concept is the names of african-american and european-american recognitions. Its attributes are composed of terms describing african americans and european americans, and sentence versions of the attributes and target concept terms are inserted in the sentence template. Specifically, each corresponding word in The example vocabulary set is placed into several semantically ambiguous sentence templates, such as { "This is a rose", "A rose is her", "This will be a rose", "The rose are hairs", … … }. Each corresponding vocabulary in the attribute vocabulary set is placed into several semantically ambiguous sentence templates, such as { "the re is love", "That is happy", "This is a friend", "the y are evenl", … … }.

2. Model training plate

And a model training plate, wherein the model training plate adopts a multi-layer bidirectional Transformer encoder Bert model, L is expressed as the number of layers, H is expressed as the number of hidden layers, and A is expressed as the number of self-attack heads. In all cases, the number of feedforward (output of full link layer) is set to 4H, and when H is 768, the number of feedforward is 3072, and the model architecture used is as follows:

BERT_base：L＝12，H＝768,A＝12,TotalParameters＝110M

3. hidden vertical field analysis module

The method measures the explicit attitude and the implicit attitude of the group to the target word respectively by using a sentence set obtained by analyzing the data by using a model.

As shown in fig. 1 and fig. 2, for big data words in wikipedia and twitter, a sentence set a is obtained through data analysis, and a sentence set B containing both target words and attribute words is extracted. The invention first classifies the words according to the concepts and the words of the attributes contained in the sentences, and gives a pair of concepts C_iAnd C_jAnd a pair of attributes D_pAnd D_qDefinition of Cw_iAnd Cw_jAre respectively concept C_iAnd C_jThe concept vocabulary set of Dw_pAnd Dw_qAre respectively an attribute D_pAnd D_qThe attribute vocabulary set of (2) contains the concept C_iOr C_jThe concept vocabulary and the attribute D of_pOr D_qThe attribute vocabulary of (a) is set B. For example, consider two sets of target words (e.g., programmer, engineer, scientist, and nurse, teacher, librarian, etc.) and two sets of attribute words (e.g., male, female). The original assumption is that there is no difference in the relative similarity between the two sets of target words and the two sets of attribute words. In the form, let X and Y be two groups of equal-size target words, A two groups of attribute words, and B two groups of attribute words. Let cos (a; b) denote the cosine of the angle between vectors a and b.

Specifically, the attitude deviation is calculated by the following formula:

where s (a, C1, C2) represents the degree of association between w and the attribute word.

Wherein e.size represents the concept of population pair C_iOr C_jAnd attribute D_pOr D_qThe correlation size of (1) is the attitude deviation.

In the part of calculating attitude deviation, in order to ensure the scientificity of the invention, a significance test is also set. Wherein:

p＝Pr_i[s(X_i,Y_i,A,B)＞s(X,Y,A,B)] (3)

for model training, the Bert model trained based on set A is recorded as Bert_A(ii) a And marking a Bert model trained based on the set A-B as Bert_A-B(ii) a We use Bert separately_AAnd Bert_A-BAnd (4) carrying out coding calculation on the sentence template in the step (1).

In The interior attitude calculation, sentences of 'target words + fictional words + attribute words', such as 'The Black man is good/bad' and 'The man is good/bad for math', are manually constructed and are called 'positive/negative manual combination sentences'. Sentences containing both target words and positive/negative attribute words in the real corpus are called positive/negative real combination sentences, and sentences of 'attribute words + fictional words' are artificially constructed and are called 'positive/negative artificial attribute sentences'. The code called Bert _ A-B for the positive artificial combination sentence is 'Bert 1', the code called Bert _ A-B for the negative artificial combination sentence is 'Bert 2', the code called Bert _ A-B for the positive negative artificial combination sentence is 'Bert 1.2', the code called Bert _ A-B for the positive real combination sentence is 'Bert 3', the code called Bert _ A-B for the negative real combination sentence is 'Bert 4', the code called Bert _ A-B for the positive negative real combination sentence is 'Bert 3.4', the code called Bert _ A for the positive artificial combination sentence is 'Bert 5', the code called Bert _ A for the negative artificial combination sentence is 'Bert 6', the code called Bert _ A for the positive negative artificial combination sentence is 'Bert 5.6', the code called Bert _ A for the positive artificial combination sentence is 'Bert 7', the code called Bert _ A for the negative true combined sentence is 'Bert 8', the code called Bert _ A for the positive negative true combined sentence is 'Bert 7.8', the code called Bert _ A-B for the positive artificial attribute sentence is 'Bert 9', the code called Bert _ A-B for the negative artificial attribute sentence is 'Bert 10', the code called Bert _ A for the positive artificial attribute sentence is 'Bert 11', and the code called Bert _ A for the negative artificial attribute sentence is 'Bert 12'.

In order to illustrate that the model with implicit bias (Bert _ A-B) and the model with implicit hybrid bias (Bert _ A) have obvious difference in the attitude involved by the attribute words when the attitude expression sentences of the artificial structure are coded. This example is designed for the following experiment:

ImplicitBias.Size1＝S(Bert1.2,Bert9,Bert10)-S(Bert5.6,Bert11,Bert12) (4)

in order to illustrate that the model with implicit bias (Bert _ A-B) and the model with explicit implicit mixed bias (Bert _ A) have obvious difference in the attitude involved by the attribute words when the real attitude expression sentences are coded. This example designs the following experiment:

ImplicitBias.Size2＝S(Bert3.4,Bert9,Beer10)-S(Bert7.8,Bert11,Bert12) (5)

specifically, the experimental material of this embodiment has 2 related data sets, and these types of data sets are widely used in related research, which are: twitter data sets and wikipedia data sets. The invention analyzes the attitude and the attitude evolution of the hidden ground in the group.

1. Macroscopic computation of hidden vertical attitude within a cluster

In the invention, the word level implicit association test is extended to the sentence level implicit association test.

Specifically, each corresponding word in The example vocabulary set is placed into several semantically ambiguous sentence templates, such as { "This is a rose", "goose is her", "This will be a rose", "The rose are hairs", … … }. Each corresponding vocabulary in the attribute vocabulary set is placed into several semantically ambiguous sentence templates, such as { "the re is love", "That is happy", "This is a friend", "the y are evenl", … … }. And (5) coding the sentences through the trained Bert model, and calculating the relation between sentence vectors.

TABLE 1

Table 1 shows the differences between the wikipedia and twitter corpus based implicit association test at the sentence level and the word embedding level based implicit association test. d represents the effect magnitude. p represents a hypothetical test value. Wherein the first column is in turn: the target word: flowers and insects, musical instruments and weapons, european and african american names, male and female names, mathematics and art, science and art, mental and physical ailments, young and old names. The second column is sequentially: attribute words: pleasant and unpleasant, cause and home, male and female, temporary and permanent, pleasant and unpleasant.

2. Microcosmic calculation of hidden vertical attitude in group

Although the bias of people on target words is reflected on sentence-level coding in the macroscopic calculation of the hidden attitude in the group, the influence of an AI model on the coding of a real sentence after inheriting the explicit/implicit bias of people is not researched. Especially those used to express the encoding of a sentence for a particular object (i.e., a "target word + attribute word" sentence). Whether the latter is biased in the encoding means how well the AI model understands the attitudes expressing sentences and how well the human reader understands the sentences. For those sentences that are expressed in the exons that are not considered as authentic, it is difficult for human readers to judge the integrity without rich background. Learning a biased AI model does so to some extent. To investigate this idea, experiment 2 was performed.

TABLE 2

Table 2 shows the differences between the Bert model encoding the artificial combined sentences and the real corpus sentences. The DA represents the coding deviation of the model for the artificial combination sentences and the artificial attribute sentences. PA represents a hypothetical test value for DA. The DB represents the coding deviation of the model to the real combined sentence and the artificial attribute sentence. PB denotes a hypothetical test value of DB. Wherein the first column is in turn: the target word: flowers and insects, musical instruments and weapons, european and african american names, male and female names, mathematics and art, science and art, mental and physical ailments, young and old names. The second column is sequentially: attribute words: pleasant and unpleasant, cause and home, male and female, temporary and permanent, pleasant and unpleasant.

3. Dynamic analysis of hidden standings within a population

In the present embodiment, in order to study the evolution of implicit attitudes in social groups, corpus data is divided into months, and a total of 36 months of data is included. And respectively calculating the implicit attitude. As shown in fig. 3a to 3 d.

From the above analysis results, it is found that the difference between the implicit attitude deviation and the explicit attitude deviation is more significant in the social concept (african-american-european-american) than in the societal concept (flower-worm, musical instrument-weapon). There are significant differences in the implicit and extrinsic attitude deviations, and these differences are consistent with classical psychological experimental reports: flowers are more active than insects; musical instruments are more aggressive than weapons; the names of european americans are more recent than those of african americans with positive attributes associated with them. Implicit discrete field analysis also shows that some people express a certain attribute attitude of the target in an explicit way, and the true attitude is opposite in the reality. In the dynamic change of the hidden attitude within the last 3 years, the visible prejudice always exists, and is suddenly high and suddenly low, but the whole is stable. Figure 4 also illustrates that society has an increasing acceptance of women and that women have an increasing acceptance rate in the industry. The invention distinguishes between macroscopic and microscopic on a sentence-level basis. A corpus-level method is adopted, the model is independent, the influence of the AI model on the coding of the real sentence after inheriting the prejudice of people can be more fully researched in two levels of macro and micro. The invention also performs static and dynamic analysis on the group interior attitude. The implicit vertical field analysis method also provides a thought for analyzing the influence of implicit bias on the integrity of explicit expression.

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto and changes may be made without departing from the scope of the invention as defined by the claims and their equivalents.

Claims

1. A hidden vertical analysis method in a group based on word vectors and Bert is characterized in that a Bert model is trained through a text corpus, and hidden vertical analysis in the group is carried out by combining sentence vectors; the system comprises a data analysis module, a model training module and an implicit vertical analysis module;

the model training module is used for constructing a model for learning the language big data text prejudice of the social group; acquiring an embedded vector of each sentence based on a set obtained by a data analysis module according to the target words and the attribute words;