CN110442680A - The embedded vector generation method of the ideograph of view-based access control model - Google Patents

The embedded vector generation method of the ideograph of view-based access control model Download PDF

Info

Publication number
CN110442680A
CN110442680A CN201910717710.9A CN201910717710A CN110442680A CN 110442680 A CN110442680 A CN 110442680A CN 201910717710 A CN201910717710 A CN 201910717710A CN 110442680 A CN110442680 A CN 110442680A
Authority
CN
China
Prior art keywords
text
picture
recognition unit
content
ideograph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910717710.9A
Other languages
Chinese (zh)
Inventor
刘斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwestern University Of Finance And Economics
Original Assignee
Southwestern University Of Finance And Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwestern University Of Finance And Economics filed Critical Southwestern University Of Finance And Economics
Priority to CN201910717710.9A priority Critical patent/CN110442680A/en
Publication of CN110442680A publication Critical patent/CN110442680A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a kind of embedded vector generation methods of the ideograph of view-based access control model, this method includes that content of text is generated corresponding mask picture according to recognition unit, generate black background picture corresponding with mask picture, mask picture is superimposed on single channel gray scale picture of the synthesis based on content of text on background picture, extracts coding vector of the corresponding gray matrix of single character as corresponding character.The embedded vector generation method of the ideograph of view-based access control model proposed by the present invention can simplify natural language processing process, significantly improve the efficiency of computer disposal content of text.

Description

The embedded vector generation method of the ideograph of view-based access control model
Technical field
The invention belongs to natural language processing technique fields, and in particular to a kind of ideograph of view-based access control model it is embedded to Measure generation method.
Background technique
In the field natural language processing (Nature Language Processing), in order to allow computer understanding word Or the meaning of character, it usually needs word or character are encoded, that is, uniformly change into word or character corresponding Real vector, each word or character correspond a real vector.To word or character after coding, natural language Processing Algorithm can handle vector directly to achieve the purpose that understand word or character.Term vector is largely applied to text at present This classification, text emotion analysis, machine are read, the fields such as machine translation.
Two kinds of presently most universal term vector coding modes are only heat (one- based on dictionary or character repertoire respectively Hot) coding and the word2vec method (such as skip-gram, CBOW, GLOVE etc.) based on distributed term vector model.
One-hot coding be according to dictionary length, for each word construct a corresponding, unique higher-dimension two-value to Amount, only one component of a vector of this binary set are equal to 1, remaining component is all 0.Assuming that the word occurred in statistics corpus Number be 10000 (usually to reject low-frequency word), then dictionary length be 10000, and in dictionary first word term vector etc. In (1,0,0 ..., 0), the term vector of second word is (0,1,0 ..., 0), the term vector of i-th of word be (0 ... 0,1, 0 ..., 0), i.e., i-th of component is 1, remaining component is all 0.
Distributed term vector is the word for expressing to occur in dictionary with the continuous real vector of a low dimensional.It is distributed One Important Thought of term vector is word2vec, i.e., first for one 100 dimension of each word random initializtion it is random to Amount, is then continuously updated term vector according to the relative positional relationship that word occurs in the sentence of corpus, finally obtains every The unique expression of the distribution of the term vector of one word.The benefit of this method is that the distance between term vector relationship can be approximate React the semantic relation between word in ground.
It is conveniently when these approach applications are into the text analyzing task of the watch sounds language such as English text.But In the text analyzing task based on ideographs such as Chinese, above method has significant limitation.Below by taking Chinese as an example Son, the specific deficiency for introducing current existing term vector scheme.
Firstly, if generating scheme using traditional term vector to encode to Chinese, either one-hot coding is still Distributed coding requires to clean progress corpus.Corpus cleaning includes rejecting low frequency word, removes non-Chinese Character Accord with (text and character including punctuation mark and other language) etc.;
Secondly, the minimum morpheme unit of Chinese is word, thus the term vector of either one-hot coding or distributed word to Amount requires to carry out Chinese word segmentation processing to corpus.Chinese word segmentation be by continuous word sequence according to certain specification again It is combined into Chinese terms sequence.Namely traditional term vector needs independent segmentation methods to pre-process corpus.
Third, after having carried out corpus cleaning and participle, either one-hot coding or distributed coding be next all Also need to safeguard a dictionary.For one-hot coding, the length of term vector is consistent with the length of dictionary, therefore Rational choice dictionary A more troublesome problem of length, dimension is excessively high, although can retain more words in corpus, but increases word The length of vector.On the contrary, the quantity for the word safeguarded in dictionary certainly will will be reduced if reducing dimension.And distributed coding, though For its right corresponding term vector dimension compares one-hot coding scheme, term vector dimension is low, but obtained term vector quality There are problems that heavy dependence corpus quality.
Summary of the invention
The main purpose of the present invention is to provide a kind of embedded vector generation methods of the ideograph of view-based access control model, it is intended to Solve the above technical problem present in existing method.
To achieve the above object, the present invention provides a kind of embedded vector generation method of ideograph of view-based access control model, packet Include following steps:
S1, content of text is divided according to the recognition unit of setting, each recognition unit is sequentially generated corresponding Mask picture;
S2, the black background picture not less than mask dimension of picture in step S1 is generated;
S3, using the mask picture in step S1 as prospect, mask picture is superimposed on the background picture in step S2, close At the single channel gray scale picture based on content of text;
S4, the corresponding gray scale of single character is extracted according to the single channel gray scale picture based on content of text that step S3 is obtained Matrix, using gray matrix as the coding vector of corresponding character.
Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit Generate corresponding mask picture specifically:
The monocase in text is set as recognition unit, content of text is generated into correspondence according to monocase for unit one by one Mask picture.
Further, the monocase is specially Chinese character, punctuate, space and other spcial characters.
Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit Generate corresponding mask picture specifically:
The sentence in text is set as recognition unit, content of text is generated into corresponding cover according to sentence for unit one by one Code picture.
Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit Generate corresponding mask picture specifically:
The paragraph in text is set as recognition unit, content of text is generated into corresponding cover according to paragraph for unit one by one Code picture.
Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit Generate corresponding mask picture specifically:
The article in text is set as recognition unit, content of text is generated into corresponding cover according to article for unit one by one Code picture.
Further, the step S4 further include:
Binary conversion treatment is carried out to the single channel gray scale picture based on content of text that step S3 is obtained, extracts single character Corresponding two values matrix, using two values matrix as the coding vector of corresponding character.
The invention has the following advantages:
(1) after the model for the coding vector input deep learning that the present invention generates, in text classification, text emotion is analyzed, There is the performance boost of conspicuousness in the tasks such as the Chinese natural languages such as text understanding processing compared with the model based on traditional term vector;
(2) present invention does not need to do the pretreatment such as corpus cleaning, Chinese word segmentation, that is, eliminates raw based on traditional term vector At all problem of pretreatment of scheme;
(3) vector dimension size (the usually several hundred dimensions, than each character as shown in figure 1 for the coding vector that the present invention generates It is 728 dimensions) between one-hot coding (dimension is usually dictionary length correlations that are thousands of or even up to ten thousand, safeguarding with it) and word2vec Between (dimension is usually 100 or so) scheme, that is, there is sparsity to have and keep low dimensional characteristic;
(4) present invention does not need again one dictionary of additional maintenance, i.e., all can serve as to any Chinese character or other characters Picture is handled.
Detailed description of the invention
Fig. 1 is the embedded vector generation method flow diagram of ideograph of view-based access control model of the invention;
Fig. 2 is the mask picture and background picture schematic diagram generated in the present invention;
Fig. 3 is the coding vector schematic diagram of the character generated in the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.
The primary solutions of the embodiment of the present invention are:
As shown in Figure 1, a kind of embedded vector generation method of the ideograph of view-based access control model, comprising the following steps:
S1, content of text is divided according to the recognition unit of setting, each recognition unit is sequentially generated corresponding Mask picture;
S2, the black background picture not less than mask dimension of picture in step S1 is generated;
S3, using the mask picture in step S1 as prospect, mask picture is superimposed on the background picture in step S2, close At the single channel gray scale picture based on content of text;
S4, the corresponding gray scale of single character is extracted according to the single channel gray scale picture based on content of text that step S3 is obtained Matrix, using gray matrix as the coding vector of corresponding character.
The present invention, from the habit angle of simulation mankind's natural reading Chinese, is mentioned by taking Chinese character as an example for computer The scheme that Chinese text is encoded for a kind of view-based access control model signal, to improve the effect of computer disposal Chinese text Rate.
In an alternate embodiment of the present invention where, above-mentioned steps S1 generates corresponding mask using Chinese text content (mask) picture.During generating mask picture, the monocase in text is set as recognition unit, content of text is pressed It is that unit generates corresponding mask picture one by one according to monocase;Here monocase is specially Chinese character, punctuate, space and other spies Different character etc..
Preferably, the present invention can also set the sentence in text as recognition unit, be according to sentence by content of text Unit generates corresponding mask picture one by one.
Preferably, the present invention can also set the paragraph in text as recognition unit, be according to paragraph by content of text Unit generates corresponding mask picture one by one.
Preferably, the present invention can also set the entire article in text as recognition unit, by content of text according to text Chapter is that unit generates corresponding mask picture one by one.
As shown in Fig. 2, according to the reading habit of the mankind, generating each Chinese one by one from front to back for a Chinese articles Other characters such as word and punctuation mark, it is assumed that previous cycle to Chinese character " certainly " firstly generates " certainly " corresponding white mask, As shown in Fig. 2, the size of the mask picture generated is set as 25*27.
In an alternate embodiment of the present invention where, above-mentioned steps S2 generates black background picture, the size of background picture More than or equal to the size of mask picture.As shown in Fig. 2, Chinese character " certainly " mask corresponds to the background of a black, black background The size of picture is set as 26*28.
In above-mentioned steps S1 and S2, the size for generating mask picture and background picture can carry out according to actual needs Adjustment so that the size of background picture be greater than or equal to mask picture size, so as to subsequent step to the picture after synthesis into Row processing.
In an alternate embodiment of the present invention where, above-mentioned steps S3 is using the mask picture in step S1 as prospect, will before Scape mask picture is superimposed on the background picture in step S2, synthesizes the single channel gray scale picture based on content of text;Such as Fig. 2 institute Show, the corresponding gray scale picture of Chinese character " certainly " is the picture of a 26*28.
In an alternate embodiment of the present invention where, above-mentioned steps S4 is directed to the mask of single character in step S1, according to The single channel gray scale picture based on content of text that step S3 is obtained extracts the corresponding gray matrix of single character, by gray matrix Coding vector as corresponding character.In corresponding step S1 to a sentence, a paragraph, article mask, step The gray matrix (picture) that S3 is obtained by as this sentence, this paragraph, this article coding.
Preferably, the single channel gray scale picture based on content of text that the present invention can also obtain step S3 carries out two-value Change processing, extracts the corresponding two values matrix of single character, using two values matrix as the coding vector of corresponding character.As shown in figure 3, The coding of Chinese character " certainly " is the two values matrix of 26*28.
Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This field Those of ordinary skill disclosed the technical disclosures can make according to the present invention and various not depart from the other each of essence of the invention The specific variations and combinations of kind, these variations and combinations are still within the scope of the present invention.

Claims (7)

1. a kind of embedded vector generation method of the ideograph of view-based access control model, which comprises the following steps:
S1, content of text is divided according to the recognition unit of setting, corresponding mask is sequentially generated to each recognition unit Picture;
S2, the black background picture not less than mask dimension of picture in step S1 is generated;
S3, using the mask picture in step S1 as prospect, mask picture is superimposed on the background picture in step S2, synthesize base In the single channel gray scale picture of content of text;
S4, the corresponding Gray Moment of single character is extracted according to the single channel gray scale picture based on content of text that step S3 is obtained Battle array, using gray matrix as the coding vector of corresponding character.
2. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit Are as follows:
The monocase in text is set as recognition unit, content of text is generated into corresponding cover according to monocase for unit one by one Code picture.
3. the embedded vector generation method of the ideograph of view-based access control model as claimed in claim 2, which is characterized in that the list Character is specially Chinese character, punctuate, space and other spcial characters.
4. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit Are as follows:
The sentence in text is set as recognition unit, content of text is generated into corresponding mask figure according to sentence for unit one by one Piece.
5. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit Are as follows:
The paragraph in text is set as recognition unit, content of text is generated into corresponding mask figure according to paragraph for unit one by one Piece.
6. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit Are as follows:
The article in text is set as recognition unit, content of text is generated into corresponding mask figure according to article for unit one by one Piece.
7. the embedded vector generation method of the ideograph of the view-based access control model as described in claim 1 to 6 is any, feature exist In the step S4 further include:
Binary conversion treatment is carried out to the single channel gray scale picture based on content of text that step S3 is obtained, it is corresponding to extract single character Two values matrix, using two values matrix as the coding vector of corresponding character.
CN201910717710.9A 2019-08-05 2019-08-05 The embedded vector generation method of the ideograph of view-based access control model Pending CN110442680A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910717710.9A CN110442680A (en) 2019-08-05 2019-08-05 The embedded vector generation method of the ideograph of view-based access control model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910717710.9A CN110442680A (en) 2019-08-05 2019-08-05 The embedded vector generation method of the ideograph of view-based access control model

Publications (1)

Publication Number Publication Date
CN110442680A true CN110442680A (en) 2019-11-12

Family

ID=68433204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910717710.9A Pending CN110442680A (en) 2019-08-05 2019-08-05 The embedded vector generation method of the ideograph of view-based access control model

Country Status (1)

Country Link
CN (1) CN110442680A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663380A (en) * 2012-03-30 2012-09-12 中南大学 Method for identifying character in steel slab coding image
CN105009146A (en) * 2012-06-26 2015-10-28 艾克尼特有限公司 Image mask providing a machine-readable data matrix code
CN107818321A (en) * 2017-10-13 2018-03-20 上海眼控科技股份有限公司 A kind of watermark date recognition method for vehicle annual test
CN108520258A (en) * 2018-04-04 2018-09-11 湖南科技大学 Character code mark
CN108805126A (en) * 2017-04-28 2018-11-13 上海斯睿德信息技术有限公司 A kind of long interfering line minimizing technology of text image
CN109460701A (en) * 2018-09-10 2019-03-12 昆明理工大学 A kind of character recognition method based on histogram in length and breadth

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663380A (en) * 2012-03-30 2012-09-12 中南大学 Method for identifying character in steel slab coding image
CN105009146A (en) * 2012-06-26 2015-10-28 艾克尼特有限公司 Image mask providing a machine-readable data matrix code
CN108805126A (en) * 2017-04-28 2018-11-13 上海斯睿德信息技术有限公司 A kind of long interfering line minimizing technology of text image
CN107818321A (en) * 2017-10-13 2018-03-20 上海眼控科技股份有限公司 A kind of watermark date recognition method for vehicle annual test
CN108520258A (en) * 2018-04-04 2018-09-11 湖南科技大学 Character code mark
CN109460701A (en) * 2018-09-10 2019-03-12 昆明理工大学 A kind of character recognition method based on histogram in length and breadth

Similar Documents

Publication Publication Date Title
KR102490752B1 (en) Deep context-based grammatical error correction using artificial neural networks
CN109871535B (en) French named entity recognition method based on deep neural network
CN108984530B (en) Detection method and detection system for network sensitive content
CN111178094B (en) Pre-training-based scarce resource neural machine translation training method
CN109670041A (en) A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods
CN111897949A (en) Guided text abstract generation method based on Transformer
CN111241232B (en) Business service processing method and device, service platform and storage medium
KR20210151281A (en) Textrank based core sentence extraction method and device using bert sentence embedding vector
CN112380319A (en) Model training method and related device
Tang et al. FontRNN: Generating Large‐scale Chinese Fonts via Recurrent Neural Network
Huang et al. Addressing domain adaptation for chinese word segmentation with global recurrent structure
CN110851594A (en) Text classification method and device based on multi-channel deep learning model
CN111859940B (en) Keyword extraction method and device, electronic equipment and storage medium
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN110866087B (en) Entity-oriented text emotion analysis method based on topic model
CN113535915A (en) Method for expanding a data set
WO2021239631A1 (en) Neural machine translation method, neural machine translation system, learning method, learning system, and programm
CN111178009B (en) Text multilingual recognition method based on feature word weighting
Dilawari et al. Neural attention model for abstractive text summarization using linguistic feature space
CN111523325A (en) Chinese named entity recognition method based on strokes
CN110442680A (en) The embedded vector generation method of the ideograph of view-based access control model
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
de Lima Santos et al. Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese.
Rajalingam et al. Image captioning in tamil language with merge architecture
Kim et al. Label Embedding for Improving Classification Accuracy UsingAutoEncoderwithSkip-Connections

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191112