CN111737466A

CN111737466A - Method for quantizing interactive information of deep neural network

Info

Publication number: CN111737466A
Application number: CN202010558767.1A
Authority: CN
Inventors: 李超; 徐勇军
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2020-10-02
Anticipated expiration: 2040-06-18
Also published as: CN111737466B

Abstract

The invention provides a method for quantizing deep neural network interaction information, which comprises the following steps: s1, obtaining a sample from the natural language processing field data set, wherein the sample comprises a plurality of units, each unit corresponds to a word, and the units in the sample are subjected to multiple aggregation processing until the units in the sample are aggregated into a unit; s2, constructing a tree diagram reflecting the inter-word interaction information of the deep neural network internal modeling according to the unit aggregation mode in the multiple aggregation processing process of the given sample in the step S1. The method can objectively quantify the interactive information among the input sample words modeled in the deep neural network, and cluster the adjacent units with obvious interaction according to the interactive information ratio to finally obtain a tree-shaped hierarchical structure reflecting the interactive information among the words modeled in the deep neural network, thereby providing a universal method for further understanding the deep neural network.

Description

Method for quantizing interactive information of deep neural network

Technical Field

The invention relates to the technical field of deep learning, in particular to application of a deep neural network in the field of natural language processing, and more particularly relates to a method for quantizing interactive information of the deep neural network.

Background

At present, a deep neural network (deep neural network) shows excellent modeling capability on various tasks of natural language processing, but the deep neural network is generally considered as a black box model, the internal modeling logic of the deep neural network is invisible, and the characteristic becomes a blind point for effectively evaluating the accuracy and reliability of a final decision result, so that the interpretation of the internal modeling logic of the neural network becomes an important research direction. Particularly in the field of natural language processing, the neural network models which mutual information among input words is still opaque, so that decoupling and quantizing the mutual information among all words in an input sentence modeled by the deep neural network plays an important role in understanding the internal logic and decision making mechanism of the neural network.

Disclosure of Invention

Therefore, the present invention is directed to overcoming the above-mentioned drawbacks of the prior art and providing a new method for deep neural network mutual information quantification for understanding the logic inherent in the deep neural network.

The invention discloses a method for quantizing deep neural network interaction information, which is used for constructing a tree diagram for quantizing the interaction information among words modeled by a deep neural network in a natural language processing task, and the method comprises the following steps:

s1, obtaining a sample from the natural language processing field data set, wherein the sample comprises a plurality of units, each unit corresponds to a word, and the units in the sample are subjected to multiple aggregation processing until the units in the sample are aggregated into a unit; wherein each polymerization treatment comprises: inputting a current sample into a deep neural network, and calculating a Shapril value of each unit in the current sample according to the output of the deep neural network, wherein the deep neural network is used for a natural language processing task in the field of natural language processing; calculating the interactive gain rate between every two adjacent units based on the sand-pril value of each unit, aggregating the two adjacent units with the maximum interactive gain rate into a new unit, and forming a new current sample with other units in the current sample for the next aggregation treatment;

s2, constructing a tree diagram reflecting the inter-word interaction information of the deep neural network internal modeling according to the unit aggregation mode in the multiple aggregation processing process of the given sample in the step S1. Preferably, the binary tree is constructed as follows: s31, forming a first layer leaf node of the binary tree from bottom to top by using all units in the sample; and S32, according to the aggregation sequence, taking a new unit formed after each aggregation as a father node of two adjacent units before the aggregation until a root node of the tree is formed.

Wherein the value of the salpril for each cell in the current sample is a weighted average of the marginal contributions of that cell in the set of all other cells in the current sample that may be made up. The value of salpril for each unit in the current sample is determined as follows:

wherein, v represents a neural network,

representing the ith cell a in the current sample_iThe value of Shapril, N represents the set of all units in the current sample, | N | represents the size of the set N, and S represents the unit a except the ith unit in the current sample_iOther possible combinations of units than S, where S represents the size of the set S, where factorial, v (-) represents the output of the deep neural network, and v (S ∪ { a) }_iMeans of the ith unit a_iMarginal contribution to set S, where v (S ∪ { a)_iDenotes the addition of the i-th unit a to the set S_iThe resulting outputs from the set S input neural network are denoted by v (S).

The interaction gain ratio between two adjacent units is the ratio of the interaction gain of the two adjacent units to all interaction information interacting with the two units.

Preferably, the interaction gain ratio between two adjacent cells is determined by:

wherein [ S ]₁]Is represented by the set S₁A unit formed by polymerizing all the units in (A), (B), (C) and (C) [ S ]₂]Is represented by the set S₂A unit formed by polymerizing all the units in (A), (B), (C) and (C) [ S ]₁]、[S₂]Being two adjacent cells, B_between([S₁],[S₂]) Is two adjacent units [ S₁]、[S₂]Gain of interaction between, [ S ]₁]Is a unit of [ S₁]Units adjacent to the left side thereof before being polymerized, [ S ]₂]Is a unit of [ S₂]Its right-adjacent unit before being polymerized, B_between([S₁]',[S₁]) Is a unit [ S₁]'、[S₁]Gain of interaction between, B_between([S₂],[S₂]') is a unit [ S₂]、[S₂]' mutual gain between, and unit [ S₁]、[S₂]The related total interaction information is B_between([S₁],[S₂])、B_between([S₁]',[S₁])、B_between([S₂],[S₂]')、φ([S₁])、φ([S₂]) Where phi ([ S ]₁])、φ([S₂]) Are respectively a unit [ S₁]、[S₂]A value of salpril.

And the interactive gain of the two adjacent units is the difference between the interactive gain of the new unit after the two adjacent units are aggregated and the interactive gain of the two adjacent units before the two adjacent units are not aggregated.

Preferably, the interaction gain of two adjacent cells is determined by:

B_between([S₁],[S₂])＝B([S])-B([S₁])-B([S₂])

wherein [ S ]]Denotes a unit formed by the polymerization of all units in the set S, [ S ]₁]Is represented by the set S₁Wherein all units are polymerized to form a unit, [ S ]₂]Is represented by the set S₂Wherein all units are aggregated to form a unit, B (-) represents the interaction gain within the unit, B_between(. represents) interactions between unitsAnd (4) gain.

In some embodiments of the invention, each cell interaction gain is determined by:

wherein, [ S ] represents a unit formed by aggregating all units in the set S, b is a unit in the set S, and N \ S represents a set formed by the units in the set N except the set S.

Compared with the prior art, the invention has the advantages that: the invention innovatively provides a method for quantitatively evaluating and understanding internal logic of a deep neural network, which can be used for objectively quantifying interaction information among input sample words modeled in the deep neural network by combining the thought of a game theory, providing a special index to evaluate an interaction gain rate and constructing a tree structure according to the interaction gain rate, clustering adjacent units with obvious interaction according to the size of the interaction information rate, and finally obtaining a tree-shaped hierarchical structure reflecting the interaction information among the words modeled in the deep neural network, thereby providing a universal method for further understanding the deep neural network.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart illustrating a method for deep neural network interaction information quantification according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a binary tree established based on an example sample in a method for quantizing deep neural network interaction information according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention aims to provide a method for quantizing interactive information among input words of deep neural network modeling aiming at the black box property of the current deep neural network, so that the interaction among the words of the deep neural network modeling is objectively explained, and the understanding of the internal logic and decision mechanism of the deep neural network is facilitated.

According to an embodiment of the present invention, as shown in fig. 1, there is provided a method for deep neural network interaction information quantization, including steps T1, T2, T3, T4, and T5, each of which is described in detail below.

In step T1, a deep neural network for a natural language processing task in the natural language processing domain is obtained.

In step T2, based on the given input sample, calculating a value of salpril for each unit in the given input sample according to the output of the deep neural network feature layer; wherein the given input sample is a sentence from the natural language processing domain-related dataset comprising a plurality of units, and each unit in the initial given input sample consists of a word. The output of the deep neural network feature layer can be the output of any feature layer of the deep neural network or the final output of the deep neural network.

The Shapril value method is a distribution mode for fairly distributing the benefits which are obtained by each member according to the contribution of each member in one cooperation in the cooperative game theory. In the present invention, given a trained neural network v (i.e., the field game), the input samples N ═ a₁,a₂,...,a_nWhere n is the number of cells contained in the input sample (each cell initially consisting of a word), a_i(1 < i < n) denotes the individual units in the input sample (i.e., the players of the game). In order to obtain more benefits, part of the participants cooperate to form a coalition (coordination) S (i.e. a set of partial units), and the total benefit obtained by the coalition S in the game is v (S), i.e. the output value of the neural network when the input unit set is S. If a participant a is outside the federation S_iAlso join the federation, the total benefit ultimately obtained by the federation is v (S ∪ { a)_i}), then v (S ∪ { a)_i}) -v (S) denotes participant a_iThe marginal contribution to the federation S. The salpril values model a weighted average of each participant' S contribution to the marginal contribution from the various possible leagues S in the game v. The ith cell a in the input sample_iThe value of salapril can be expressed as phi^v(a_i)，φ^v(a_i) Calculated by equation (1):

wherein v represents a neural network, N represents a set composed of all units in the current input sample, | N | represents the size of the set N, and S represents the unit a except the ith unit in the current sample_iOther units than possible may form a set, | S | represents the size of the set S |, and |. Representing factorial, v (-) represents the output of the deep neural network. And finally, the size of the sand pril value is used for proportionally distributing the benefits of each member in the game, so that the influence degree of each unit in the input sample on the final decision of the neural network is obtained, and the larger the sand pril value is, the larger the influence of the unit on the final decision of the neural network is, and vice versa.

Preferably, phi is calculated by means of sampling^v(a_i). For example, for a given input sample containing N elements, denoted as the set N ═ a₁,a₂,...,a_nConsider the currently needed computing unit a_iSaapril value of phi^v(a_i) Set N \ a formed by the rest of other units in set N_iSampling once to obtain a set S, masking word vectors corresponding to units (namely, units not included in the set S) which are not sampled in the input samples to be 0 vectors, thereby obtaining samples S after masking, sending the samples S to a deep neural network to obtain v (S), and similarly, if the unit a is subjected to masking, obtaining v (S) by using the word vectors_iAdding the sample into the sampled set S, and then masking the processed sample S ∪ { a }_iSending into neural network to obtain v (S ∪ { a) }_i})，v(S∪{a_iThe current unit a is_iThe marginal contribution amount of (c). According toIn one embodiment of the invention, sample M (M ≦ 2)^n-1) Then, calculate the average value of M times of marginal contribution amounts as unit a_iSaapril value of phi^v(i) In that respect Similarly, the value of salpril for each unit in the input sample can be obtained.

In step T3, the interaction gain ratio between any two adjacent cells in a given input sample is calculated. For a given game (in the present invention, a trained deep neural network v), several participants form an indivisible whole (in the present invention, a whole [ S ] formed by some units in a given input sample), that is, the whole [ S ] is taken as one participant, S represents a union of the several participants (in the present invention, the participants are units, and S is a set of the units), so that the interaction gain B ([ S ]) included in the participant [ S ] is:

wherein b is an element in the set S, [ S ]]Is a unit aggregated by all units in the set S, v^(N ^\S)∪{[S]}Representing actual participants in game v as members of set N minus members of set S plus participant [ S ]]Gain obtained (in the present invention, v^{(N\S)∪{[S]}}Means that the cell in a given input sample set is subtracted by the cell in the set S plus the cell S]Output of the deep neural network v as input); similarly, v^(N\S)∪{b}Representing the actual participant in game v as a member in N minus a member in S plus the benefit earned by participant b (in the present invention, v^(N\S)∪{b}Representing the output of the deep neural network v at the input given the unit in the input sample set N minus the unit in the set S plus the unit b).

Indicating unit [ S]At v^{(N\S)∪{[S]}}The values of the underlying safapril, in the same way,

indicating unit b at v^(N\S)∪{b}The following values of salpril, the calculation of the salpril values, are referred to the aforementioned formula (1).

According to one example of the present invention, assume that the set of all elements in an input sample is N ═ a₁,a₂,...,a_nConsider a unit formed by the aggregation of some two adjacent units

The unit (containing a plurality of words) and other units left in the sample form a clustered sample

Can calculate the unit

And then the cross gain of the unit is obtained according to the formula (2):

because the units in the input samples are continuously aggregated, the number of the units contained in the aggregated samples is continuously reduced until the units are finally aggregated into a unit (namely, the whole input sample is taken as a unit). In this polymerization process, if two adjacent units [ S ]₁]，[S₂]Are polymerized into a unit [ S ]]Then, the interactive gain index between the three is the following equation:

B([S])＝B([S₁])+B([S₂])+B_between([S₁]，[S₂]) (3)

B([S₁]),B([S₂]) Are respectively two [ S₁]、[S₁]Inter-gain within a cell, B_between([S₁],[S₂]) Is the interaction gain between two units, then B_between([S₁],[S₂]) This can be derived as follows:

B_between([S₁],[S₂])＝B([S])-B([S₁])-B([S₂])

finally, the interaction gain B between two adjacent cells can be calculated_betweenThe ratio r of all the interaction information interacting with these two units. When two units [ S ]₁],[S₂]Are aggregated into a unit [ S ]]Time, note unit [ S₁]The left adjacent unit before being polymerized is [ S ]₁]', unit [ S₂]The unit adjacent to the right before being polymerized is [ S ]₂]', unit [ S₁],[S₂]The gain of the interaction between is B_between([S₁],[S₂]) Unit [ S ]₁]'，[S₁]The gain of the interaction between is B_between([S₁]',[S₁]) Unit [ S ]₂]，[S₂]' the gain of interaction between B_between([S₂],[S₂]'), and the unit [ S ]₁],[S₂]The related total interaction information is B_between([S₁],[S₂])、B_between([S₁]',[S₁])、B_between([S₂],[S₂]')、φ([S₁])、φ([S₂]) Wherein phi ([ S ]₁]),φ([S₂]) Are respectively a unit [ S₁]、[S₂]A value of salpril of [ S ]₁]、[S₂]The interactive gain ratio between the two is:

in step T4, two adjacent cells with the maximum inter-gain ratio are aggregated to form a new cell, and the new cell and the remaining other cells in the sample together form a once aggregated sample; the steps T2 to T4 are repeated with the aggregated sample as the new given input sample, and the iteration is continued until the aggregated sample contains only one unit, which obviously contains all the words in the initial sample.

For example, according to an example of the present invention, the set of all elements in the input sample is N ═ a₁,a₂,...,a_nThe cells in the set(n-1) combinations of two adjacent units that can be formed, the two units with the largest interaction gain ratio, such as a_i,a_jAre polymerized into a unit

Forming a clustered sample with the rest (n-2) units

Obviously, N' contains N-1 units, and the iteration is repeated, and finally, a sample N containing only one unit is formed_root＝[{a₁,a₂,...,a_n}]Set N of_rootContains all the words a in a given input sample₁,a₂,...,a_n。

In step T5, a binary tree containing a tree-like hierarchy is created according to the process of continuously aggregating cells in the input samples. Obviously, the leaf nodes of the tree are the words in the input sample, each time two units are aggregated into one unit, an intermediate node of the tree is formed, and as the aggregation progresses, the root node of the tree is finally formed, and a binary tree can be built in such a bottom-up manner.

According to one example of the invention, a binary tree is constructed with a given input sample { the, sun, is, coming, out } as an example. The method comprises the steps that a sample composed of the units, sun, is, corning and out is input into a deep neural network trained by a natural language processing field data set, the units in the sample are continuously aggregated according to the output of the deep neural network, for example, when the units are aggregated for the first time, the units are aggregated into a new unit, the sun and the sun are aggregated into a new sample { the sun, is, corning and out } with the rest of the units is, corning and out; during the second polymerization, the units the sun and is are polymerized to form a new unit the sun is, and then the new unit is combined with the rest other units and out to form a new sample { the sun is, combining and out } after polymerization; during the third aggregation, the units corning and out are aggregated to form a new unit corning out, and then a new aggregated sample { the sun is, corning out } is formed with the rest other units the sun is; the sun is and the corning out are aggregated to form a unit of the sun is corning out in the fourth aggregation, the unit is a root node of the tree, a binary tree corresponding to a sample { the, sun, is, corning, out } constructed according to the aggregation process is shown in fig. 2, and the inter-word interaction information of the deep neural network modeling is displayed through the binary tree, so that the understanding of the internal logic of the deep neural network can be facilitated.

The method provided by the invention explains the internal logic of the neural network by utilizing a hierarchical structure, can objectively quantify the interactive information among the input sample words modeled in the deep neural network, and clusters the adjacent units with obvious interaction according to the ratio of the interactive information, finally obtains a tree-shaped hierarchical structure reflecting the interactive information among the words modeled in the deep neural network, and provides a universal method for further understanding the deep neural network. The method can be used for constructing the tree diagram for any deep neural network used for natural language processing tasks in the natural language processing field so as to understand the inherent logic of the deep neural network.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for deep neural network interaction information quantification, which is used for constructing a tree diagram for quantifying inter-word interaction information modeled by a deep neural network in a natural language processing task, and is characterized by comprising the following steps:

s1, obtaining a sample from the natural language processing field data set, wherein the sample comprises a plurality of units, each unit corresponds to a word, and the units in the sample are subjected to multiple aggregation processing until the units in the sample are aggregated into a unit;

wherein each polymerization treatment comprises:

inputting a current sample into a deep neural network, and calculating a Shapril value of each unit in the current sample according to the output of the deep neural network, wherein the deep neural network is used for a natural language processing task in the field of natural language processing;

calculating the interactive gain rate between every two adjacent units based on the sand-pril value of each unit, aggregating the two adjacent units with the maximum interactive gain rate into a new unit, and forming a new current sample with other units in the current sample for the next aggregation treatment;

s2, constructing a tree diagram reflecting the inter-word interaction information of the deep neural network internal modeling according to the unit aggregation mode in the multiple aggregation processing process of the given sample in the step S1.

2. The method of claim 1, wherein the value of salpril for each cell in the current sample is a weighted average of the marginal contributions of the cell in the set of all other cells in the current sample.

3. The method for deep neural network mutual information quantification as claimed in claim 2, wherein the value of the salpril of each unit in the current sample is determined by:

wherein, v represents a neural network,

4. The method of claim 3, wherein the interaction gain ratio between two adjacent units is the ratio of the interaction gain of the two adjacent units to the total interaction information interacted with the two units.

5. The method of claim 4, wherein the interaction gain ratio between two adjacent units is determined by:

6. The method of claim 5, wherein the inter-gain of two neighboring units is a difference between an inter-gain of a new unit after aggregation of the two neighboring units and an inter-gain of the two neighboring units before aggregation.

7. The method of claim 6, wherein the interaction gains of two adjacent units are determined by:

B_between([S₁],[S₂])＝B([S])-B([S₁])-B([S₂])

wherein [ S ]]Denotes a unit formed by the polymerization of all units in the set S, [ S ]₁]Is represented by the set S₁Wherein all units are polymerized to form a unit, [ S ]₂]Is represented by the set S₂Wherein all units are aggregated to form a unit, B (-) represents the interaction gain within the unit, B_between(. cndot.) represents the interaction gain between units.

8. The method of claim 7, wherein each unit interaction gain is determined by:

9. The method for quantizing deep neural network interaction information according to any one of claims 1 to 8, wherein the step S3 is implemented by constructing a binary tree as follows:

s31, forming a first layer leaf node of the binary tree from bottom to top by using all units in the sample;

and S32, according to the aggregation sequence, taking a new unit formed after each aggregation as a father node of two adjacent units before the aggregation until a root node of the tree is formed.

10. A computer-readable storage medium, having embodied thereon a computer program, the computer program being executable by a processor to perform the steps of the method of any one of claims 1 to 9.

11. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the steps of the method according to any one of claims 1 to 9.