CN110442680A

CN110442680A - The embedded vector generation method of the ideograph of view-based access control model

Info

Publication number: CN110442680A
Application number: CN201910717710.9A
Authority: CN
Inventors: 刘斌
Original assignee: Southwestern University Of Finance And Economics
Current assignee: Southwestern University Of Finance And Economics
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2019-11-12

Abstract

The invention discloses a kind of embedded vector generation methods of the ideograph of view-based access control model, this method includes that content of text is generated corresponding mask picture according to recognition unit, generate black background picture corresponding with mask picture, mask picture is superimposed on single channel gray scale picture of the synthesis based on content of text on background picture, extracts coding vector of the corresponding gray matrix of single character as corresponding character.The embedded vector generation method of the ideograph of view-based access control model proposed by the present invention can simplify natural language processing process, significantly improve the efficiency of computer disposal content of text.

Description

The embedded vector generation method of the ideograph of view-based access control model

Technical field

The invention belongs to natural language processing technique fields, and in particular to a kind of ideograph of view-based access control model it is embedded to Measure generation method.

Background technique

In the field natural language processing (Nature Language Processing), in order to allow computer understanding word Or the meaning of character, it usually needs word or character are encoded, that is, uniformly change into word or character corresponding Real vector, each word or character correspond a real vector.To word or character after coding, natural language Processing Algorithm can handle vector directly to achieve the purpose that understand word or character.Term vector is largely applied to text at present This classification, text emotion analysis, machine are read, the fields such as machine translation.

Two kinds of presently most universal term vector coding modes are only heat (one- based on dictionary or character repertoire respectively Hot) coding and the word2vec method (such as skip-gram, CBOW, GLOVE etc.) based on distributed term vector model.

One-hot coding be according to dictionary length, for each word construct a corresponding, unique higher-dimension two-value to Amount, only one component of a vector of this binary set are equal to 1, remaining component is all 0.Assuming that the word occurred in statistics corpus Number be 10000 (usually to reject low-frequency word), then dictionary length be 10000, and in dictionary first word term vector etc. In (1,0,0 ..., 0), the term vector of second word is (0,1,0 ..., 0), the term vector of i-th of word be (0 ... 0,1, 0 ..., 0), i.e., i-th of component is 1, remaining component is all 0.

Distributed term vector is the word for expressing to occur in dictionary with the continuous real vector of a low dimensional.It is distributed One Important Thought of term vector is word2vec, i.e., first for one 100 dimension of each word random initializtion it is random to Amount, is then continuously updated term vector according to the relative positional relationship that word occurs in the sentence of corpus, finally obtains every The unique expression of the distribution of the term vector of one word.The benefit of this method is that the distance between term vector relationship can be approximate React the semantic relation between word in ground.

It is conveniently when these approach applications are into the text analyzing task of the watch sounds language such as English text.But In the text analyzing task based on ideographs such as Chinese, above method has significant limitation.Below by taking Chinese as an example Son, the specific deficiency for introducing current existing term vector scheme.

Firstly, if generating scheme using traditional term vector to encode to Chinese, either one-hot coding is still Distributed coding requires to clean progress corpus.Corpus cleaning includes rejecting low frequency word, removes non-Chinese Character Accord with (text and character including punctuation mark and other language) etc.；

Secondly, the minimum morpheme unit of Chinese is word, thus the term vector of either one-hot coding or distributed word to Amount requires to carry out Chinese word segmentation processing to corpus.Chinese word segmentation be by continuous word sequence according to certain specification again It is combined into Chinese terms sequence.Namely traditional term vector needs independent segmentation methods to pre-process corpus.

Third, after having carried out corpus cleaning and participle, either one-hot coding or distributed coding be next all Also need to safeguard a dictionary.For one-hot coding, the length of term vector is consistent with the length of dictionary, therefore Rational choice dictionary A more troublesome problem of length, dimension is excessively high, although can retain more words in corpus, but increases word The length of vector.On the contrary, the quantity for the word safeguarded in dictionary certainly will will be reduced if reducing dimension.And distributed coding, though For its right corresponding term vector dimension compares one-hot coding scheme, term vector dimension is low, but obtained term vector quality There are problems that heavy dependence corpus quality.

Summary of the invention

The main purpose of the present invention is to provide a kind of embedded vector generation methods of the ideograph of view-based access control model, it is intended to Solve the above technical problem present in existing method.

To achieve the above object, the present invention provides a kind of embedded vector generation method of ideograph of view-based access control model, packet Include following steps:

S1, content of text is divided according to the recognition unit of setting, each recognition unit is sequentially generated corresponding Mask picture；

S2, the black background picture not less than mask dimension of picture in step S1 is generated；

S3, using the mask picture in step S1 as prospect, mask picture is superimposed on the background picture in step S2, close At the single channel gray scale picture based on content of text；

S4, the corresponding gray scale of single character is extracted according to the single channel gray scale picture based on content of text that step S3 is obtained Matrix, using gray matrix as the coding vector of corresponding character.

Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit Generate corresponding mask picture specifically:

The monocase in text is set as recognition unit, content of text is generated into correspondence according to monocase for unit one by one Mask picture.

Further, the monocase is specially Chinese character, punctuate, space and other spcial characters.

The sentence in text is set as recognition unit, content of text is generated into corresponding cover according to sentence for unit one by one Code picture.

The paragraph in text is set as recognition unit, content of text is generated into corresponding cover according to paragraph for unit one by one Code picture.

The article in text is set as recognition unit, content of text is generated into corresponding cover according to article for unit one by one Code picture.

Further, the step S4 further include:

Binary conversion treatment is carried out to the single channel gray scale picture based on content of text that step S3 is obtained, extracts single character Corresponding two values matrix, using two values matrix as the coding vector of corresponding character.

The invention has the following advantages:

(1) after the model for the coding vector input deep learning that the present invention generates, in text classification, text emotion is analyzed, There is the performance boost of conspicuousness in the tasks such as the Chinese natural languages such as text understanding processing compared with the model based on traditional term vector；

(2) present invention does not need to do the pretreatment such as corpus cleaning, Chinese word segmentation, that is, eliminates raw based on traditional term vector At all problem of pretreatment of scheme；

(3) vector dimension size (the usually several hundred dimensions, than each character as shown in figure 1 for the coding vector that the present invention generates It is 728 dimensions) between one-hot coding (dimension is usually dictionary length correlations that are thousands of or even up to ten thousand, safeguarding with it) and word2vec Between (dimension is usually 100 or so) scheme, that is, there is sparsity to have and keep low dimensional characteristic；

(4) present invention does not need again one dictionary of additional maintenance, i.e., all can serve as to any Chinese character or other characters Picture is handled.

Detailed description of the invention

Fig. 1 is the embedded vector generation method flow diagram of ideograph of view-based access control model of the invention；

Fig. 2 is the mask picture and background picture schematic diagram generated in the present invention；

Fig. 3 is the coding vector schematic diagram of the character generated in the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.

The primary solutions of the embodiment of the present invention are:

As shown in Figure 1, a kind of embedded vector generation method of the ideograph of view-based access control model, comprising the following steps:

The present invention, from the habit angle of simulation mankind's natural reading Chinese, is mentioned by taking Chinese character as an example for computer The scheme that Chinese text is encoded for a kind of view-based access control model signal, to improve the effect of computer disposal Chinese text Rate.

In an alternate embodiment of the present invention where, above-mentioned steps S1 generates corresponding mask using Chinese text content (mask) picture.During generating mask picture, the monocase in text is set as recognition unit, content of text is pressed It is that unit generates corresponding mask picture one by one according to monocase；Here monocase is specially Chinese character, punctuate, space and other spies Different character etc..

Preferably, the present invention can also set the sentence in text as recognition unit, be according to sentence by content of text Unit generates corresponding mask picture one by one.

Preferably, the present invention can also set the paragraph in text as recognition unit, be according to paragraph by content of text Unit generates corresponding mask picture one by one.

Preferably, the present invention can also set the entire article in text as recognition unit, by content of text according to text Chapter is that unit generates corresponding mask picture one by one.

As shown in Fig. 2, according to the reading habit of the mankind, generating each Chinese one by one from front to back for a Chinese articles Other characters such as word and punctuation mark, it is assumed that previous cycle to Chinese character " certainly " firstly generates " certainly " corresponding white mask, As shown in Fig. 2, the size of the mask picture generated is set as 25*27.

In an alternate embodiment of the present invention where, above-mentioned steps S2 generates black background picture, the size of background picture More than or equal to the size of mask picture.As shown in Fig. 2, Chinese character " certainly " mask corresponds to the background of a black, black background The size of picture is set as 26*28.

In above-mentioned steps S1 and S2, the size for generating mask picture and background picture can carry out according to actual needs Adjustment so that the size of background picture be greater than or equal to mask picture size, so as to subsequent step to the picture after synthesis into Row processing.

In an alternate embodiment of the present invention where, above-mentioned steps S3 is using the mask picture in step S1 as prospect, will before Scape mask picture is superimposed on the background picture in step S2, synthesizes the single channel gray scale picture based on content of text；Such as Fig. 2 institute Show, the corresponding gray scale picture of Chinese character " certainly " is the picture of a 26*28.

In an alternate embodiment of the present invention where, above-mentioned steps S4 is directed to the mask of single character in step S1, according to The single channel gray scale picture based on content of text that step S3 is obtained extracts the corresponding gray matrix of single character, by gray matrix Coding vector as corresponding character.In corresponding step S1 to a sentence, a paragraph, article mask, step The gray matrix (picture) that S3 is obtained by as this sentence, this paragraph, this article coding.

Preferably, the single channel gray scale picture based on content of text that the present invention can also obtain step S3 carries out two-value Change processing, extracts the corresponding two values matrix of single character, using two values matrix as the coding vector of corresponding character.As shown in figure 3, The coding of Chinese character " certainly " is the two values matrix of 26*28.

Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This field Those of ordinary skill disclosed the technical disclosures can make according to the present invention and various not depart from the other each of essence of the invention The specific variations and combinations of kind, these variations and combinations are still within the scope of the present invention.

Claims

1. a kind of embedded vector generation method of the ideograph of view-based access control model, which comprises the following steps:

S1, content of text is divided according to the recognition unit of setting, corresponding mask is sequentially generated to each recognition unit Picture；

S3, using the mask picture in step S1 as prospect, mask picture is superimposed on the background picture in step S2, synthesize base In the single channel gray scale picture of content of text；

S4, the corresponding Gray Moment of single character is extracted according to the single channel gray scale picture based on content of text that step S3 is obtained Battle array, using gray matrix as the coding vector of corresponding character.

2. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit Are as follows:

The monocase in text is set as recognition unit, content of text is generated into corresponding cover according to monocase for unit one by one Code picture.

3. the embedded vector generation method of the ideograph of view-based access control model as claimed in claim 2, which is characterized in that the list Character is specially Chinese character, punctuate, space and other spcial characters.

4. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit Are as follows:

The sentence in text is set as recognition unit, content of text is generated into corresponding mask figure according to sentence for unit one by one Piece.

5. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit Are as follows:

The paragraph in text is set as recognition unit, content of text is generated into corresponding mask figure according to paragraph for unit one by one Piece.

6. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit Are as follows:

The article in text is set as recognition unit, content of text is generated into corresponding mask figure according to article for unit one by one Piece.

7. the embedded vector generation method of the ideograph of the view-based access control model as described in claim 1 to 6 is any, feature exist In the step S4 further include:

Binary conversion treatment is carried out to the single channel gray scale picture based on content of text that step S3 is obtained, it is corresponding to extract single character Two values matrix, using two values matrix as the coding vector of corresponding character.