CN110442680A - The embedded vector generation method of the ideograph of view-based access control model - Google Patents
The embedded vector generation method of the ideograph of view-based access control model Download PDFInfo
- Publication number
- CN110442680A CN110442680A CN201910717710.9A CN201910717710A CN110442680A CN 110442680 A CN110442680 A CN 110442680A CN 201910717710 A CN201910717710 A CN 201910717710A CN 110442680 A CN110442680 A CN 110442680A
- Authority
- CN
- China
- Prior art keywords
- text
- picture
- recognition unit
- content
- ideograph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a kind of embedded vector generation methods of the ideograph of view-based access control model, this method includes that content of text is generated corresponding mask picture according to recognition unit, generate black background picture corresponding with mask picture, mask picture is superimposed on single channel gray scale picture of the synthesis based on content of text on background picture, extracts coding vector of the corresponding gray matrix of single character as corresponding character.The embedded vector generation method of the ideograph of view-based access control model proposed by the present invention can simplify natural language processing process, significantly improve the efficiency of computer disposal content of text.
Description
Technical field
The invention belongs to natural language processing technique fields, and in particular to a kind of ideograph of view-based access control model it is embedded to
Measure generation method.
Background technique
In the field natural language processing (Nature Language Processing), in order to allow computer understanding word
Or the meaning of character, it usually needs word or character are encoded, that is, uniformly change into word or character corresponding
Real vector, each word or character correspond a real vector.To word or character after coding, natural language
Processing Algorithm can handle vector directly to achieve the purpose that understand word or character.Term vector is largely applied to text at present
This classification, text emotion analysis, machine are read, the fields such as machine translation.
Two kinds of presently most universal term vector coding modes are only heat (one- based on dictionary or character repertoire respectively
Hot) coding and the word2vec method (such as skip-gram, CBOW, GLOVE etc.) based on distributed term vector model.
One-hot coding be according to dictionary length, for each word construct a corresponding, unique higher-dimension two-value to
Amount, only one component of a vector of this binary set are equal to 1, remaining component is all 0.Assuming that the word occurred in statistics corpus
Number be 10000 (usually to reject low-frequency word), then dictionary length be 10000, and in dictionary first word term vector etc.
In (1,0,0 ..., 0), the term vector of second word is (0,1,0 ..., 0), the term vector of i-th of word be (0 ... 0,1,
0 ..., 0), i.e., i-th of component is 1, remaining component is all 0.
Distributed term vector is the word for expressing to occur in dictionary with the continuous real vector of a low dimensional.It is distributed
One Important Thought of term vector is word2vec, i.e., first for one 100 dimension of each word random initializtion it is random to
Amount, is then continuously updated term vector according to the relative positional relationship that word occurs in the sentence of corpus, finally obtains every
The unique expression of the distribution of the term vector of one word.The benefit of this method is that the distance between term vector relationship can be approximate
React the semantic relation between word in ground.
It is conveniently when these approach applications are into the text analyzing task of the watch sounds language such as English text.But
In the text analyzing task based on ideographs such as Chinese, above method has significant limitation.Below by taking Chinese as an example
Son, the specific deficiency for introducing current existing term vector scheme.
Firstly, if generating scheme using traditional term vector to encode to Chinese, either one-hot coding is still
Distributed coding requires to clean progress corpus.Corpus cleaning includes rejecting low frequency word, removes non-Chinese Character
Accord with (text and character including punctuation mark and other language) etc.;
Secondly, the minimum morpheme unit of Chinese is word, thus the term vector of either one-hot coding or distributed word to
Amount requires to carry out Chinese word segmentation processing to corpus.Chinese word segmentation be by continuous word sequence according to certain specification again
It is combined into Chinese terms sequence.Namely traditional term vector needs independent segmentation methods to pre-process corpus.
Third, after having carried out corpus cleaning and participle, either one-hot coding or distributed coding be next all
Also need to safeguard a dictionary.For one-hot coding, the length of term vector is consistent with the length of dictionary, therefore Rational choice dictionary
A more troublesome problem of length, dimension is excessively high, although can retain more words in corpus, but increases word
The length of vector.On the contrary, the quantity for the word safeguarded in dictionary certainly will will be reduced if reducing dimension.And distributed coding, though
For its right corresponding term vector dimension compares one-hot coding scheme, term vector dimension is low, but obtained term vector quality
There are problems that heavy dependence corpus quality.
Summary of the invention
The main purpose of the present invention is to provide a kind of embedded vector generation methods of the ideograph of view-based access control model, it is intended to
Solve the above technical problem present in existing method.
To achieve the above object, the present invention provides a kind of embedded vector generation method of ideograph of view-based access control model, packet
Include following steps:
S1, content of text is divided according to the recognition unit of setting, each recognition unit is sequentially generated corresponding
Mask picture;
S2, the black background picture not less than mask dimension of picture in step S1 is generated;
S3, using the mask picture in step S1 as prospect, mask picture is superimposed on the background picture in step S2, close
At the single channel gray scale picture based on content of text;
S4, the corresponding gray scale of single character is extracted according to the single channel gray scale picture based on content of text that step S3 is obtained
Matrix, using gray matrix as the coding vector of corresponding character.
Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit
Generate corresponding mask picture specifically:
The monocase in text is set as recognition unit, content of text is generated into correspondence according to monocase for unit one by one
Mask picture.
Further, the monocase is specially Chinese character, punctuate, space and other spcial characters.
Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit
Generate corresponding mask picture specifically:
The sentence in text is set as recognition unit, content of text is generated into corresponding cover according to sentence for unit one by one
Code picture.
Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit
Generate corresponding mask picture specifically:
The paragraph in text is set as recognition unit, content of text is generated into corresponding cover according to paragraph for unit one by one
Code picture.
Further, described to divide content of text according to the recognition unit of setting, successively to each recognition unit
Generate corresponding mask picture specifically:
The article in text is set as recognition unit, content of text is generated into corresponding cover according to article for unit one by one
Code picture.
Further, the step S4 further include:
Binary conversion treatment is carried out to the single channel gray scale picture based on content of text that step S3 is obtained, extracts single character
Corresponding two values matrix, using two values matrix as the coding vector of corresponding character.
The invention has the following advantages:
(1) after the model for the coding vector input deep learning that the present invention generates, in text classification, text emotion is analyzed,
There is the performance boost of conspicuousness in the tasks such as the Chinese natural languages such as text understanding processing compared with the model based on traditional term vector;
(2) present invention does not need to do the pretreatment such as corpus cleaning, Chinese word segmentation, that is, eliminates raw based on traditional term vector
At all problem of pretreatment of scheme;
(3) vector dimension size (the usually several hundred dimensions, than each character as shown in figure 1 for the coding vector that the present invention generates
It is 728 dimensions) between one-hot coding (dimension is usually dictionary length correlations that are thousands of or even up to ten thousand, safeguarding with it) and word2vec
Between (dimension is usually 100 or so) scheme, that is, there is sparsity to have and keep low dimensional characteristic;
(4) present invention does not need again one dictionary of additional maintenance, i.e., all can serve as to any Chinese character or other characters
Picture is handled.
Detailed description of the invention
Fig. 1 is the embedded vector generation method flow diagram of ideograph of view-based access control model of the invention;
Fig. 2 is the mask picture and background picture schematic diagram generated in the present invention;
Fig. 3 is the coding vector schematic diagram of the character generated in the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention.
The primary solutions of the embodiment of the present invention are:
As shown in Figure 1, a kind of embedded vector generation method of the ideograph of view-based access control model, comprising the following steps:
S1, content of text is divided according to the recognition unit of setting, each recognition unit is sequentially generated corresponding
Mask picture;
S2, the black background picture not less than mask dimension of picture in step S1 is generated;
S3, using the mask picture in step S1 as prospect, mask picture is superimposed on the background picture in step S2, close
At the single channel gray scale picture based on content of text;
S4, the corresponding gray scale of single character is extracted according to the single channel gray scale picture based on content of text that step S3 is obtained
Matrix, using gray matrix as the coding vector of corresponding character.
The present invention, from the habit angle of simulation mankind's natural reading Chinese, is mentioned by taking Chinese character as an example for computer
The scheme that Chinese text is encoded for a kind of view-based access control model signal, to improve the effect of computer disposal Chinese text
Rate.
In an alternate embodiment of the present invention where, above-mentioned steps S1 generates corresponding mask using Chinese text content
(mask) picture.During generating mask picture, the monocase in text is set as recognition unit, content of text is pressed
It is that unit generates corresponding mask picture one by one according to monocase;Here monocase is specially Chinese character, punctuate, space and other spies
Different character etc..
Preferably, the present invention can also set the sentence in text as recognition unit, be according to sentence by content of text
Unit generates corresponding mask picture one by one.
Preferably, the present invention can also set the paragraph in text as recognition unit, be according to paragraph by content of text
Unit generates corresponding mask picture one by one.
Preferably, the present invention can also set the entire article in text as recognition unit, by content of text according to text
Chapter is that unit generates corresponding mask picture one by one.
As shown in Fig. 2, according to the reading habit of the mankind, generating each Chinese one by one from front to back for a Chinese articles
Other characters such as word and punctuation mark, it is assumed that previous cycle to Chinese character " certainly " firstly generates " certainly " corresponding white mask,
As shown in Fig. 2, the size of the mask picture generated is set as 25*27.
In an alternate embodiment of the present invention where, above-mentioned steps S2 generates black background picture, the size of background picture
More than or equal to the size of mask picture.As shown in Fig. 2, Chinese character " certainly " mask corresponds to the background of a black, black background
The size of picture is set as 26*28.
In above-mentioned steps S1 and S2, the size for generating mask picture and background picture can carry out according to actual needs
Adjustment so that the size of background picture be greater than or equal to mask picture size, so as to subsequent step to the picture after synthesis into
Row processing.
In an alternate embodiment of the present invention where, above-mentioned steps S3 is using the mask picture in step S1 as prospect, will before
Scape mask picture is superimposed on the background picture in step S2, synthesizes the single channel gray scale picture based on content of text;Such as Fig. 2 institute
Show, the corresponding gray scale picture of Chinese character " certainly " is the picture of a 26*28.
In an alternate embodiment of the present invention where, above-mentioned steps S4 is directed to the mask of single character in step S1, according to
The single channel gray scale picture based on content of text that step S3 is obtained extracts the corresponding gray matrix of single character, by gray matrix
Coding vector as corresponding character.In corresponding step S1 to a sentence, a paragraph, article mask, step
The gray matrix (picture) that S3 is obtained by as this sentence, this paragraph, this article coding.
Preferably, the single channel gray scale picture based on content of text that the present invention can also obtain step S3 carries out two-value
Change processing, extracts the corresponding two values matrix of single character, using two values matrix as the coding vector of corresponding character.As shown in figure 3,
The coding of Chinese character " certainly " is the two values matrix of 26*28.
Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this hair
Bright principle, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This field
Those of ordinary skill disclosed the technical disclosures can make according to the present invention and various not depart from the other each of essence of the invention
The specific variations and combinations of kind, these variations and combinations are still within the scope of the present invention.
Claims (7)
1. a kind of embedded vector generation method of the ideograph of view-based access control model, which comprises the following steps:
S1, content of text is divided according to the recognition unit of setting, corresponding mask is sequentially generated to each recognition unit
Picture;
S2, the black background picture not less than mask dimension of picture in step S1 is generated;
S3, using the mask picture in step S1 as prospect, mask picture is superimposed on the background picture in step S2, synthesize base
In the single channel gray scale picture of content of text;
S4, the corresponding Gray Moment of single character is extracted according to the single channel gray scale picture based on content of text that step S3 is obtained
Battle array, using gray matrix as the coding vector of corresponding character.
2. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action
Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit
Are as follows:
The monocase in text is set as recognition unit, content of text is generated into corresponding cover according to monocase for unit one by one
Code picture.
3. the embedded vector generation method of the ideograph of view-based access control model as claimed in claim 2, which is characterized in that the list
Character is specially Chinese character, punctuate, space and other spcial characters.
4. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action
Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit
Are as follows:
The sentence in text is set as recognition unit, content of text is generated into corresponding mask figure according to sentence for unit one by one
Piece.
5. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action
Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit
Are as follows:
The paragraph in text is set as recognition unit, content of text is generated into corresponding mask figure according to paragraph for unit one by one
Piece.
6. the embedded vector generation method of the ideograph of view-based access control model as described in claim 1, which is characterized in that described to incite somebody to action
Content of text is divided according to the recognition unit of setting, and it is specific to sequentially generate corresponding mask picture to each recognition unit
Are as follows:
The article in text is set as recognition unit, content of text is generated into corresponding mask figure according to article for unit one by one
Piece.
7. the embedded vector generation method of the ideograph of the view-based access control model as described in claim 1 to 6 is any, feature exist
In the step S4 further include:
Binary conversion treatment is carried out to the single channel gray scale picture based on content of text that step S3 is obtained, it is corresponding to extract single character
Two values matrix, using two values matrix as the coding vector of corresponding character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910717710.9A CN110442680A (en) | 2019-08-05 | 2019-08-05 | The embedded vector generation method of the ideograph of view-based access control model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910717710.9A CN110442680A (en) | 2019-08-05 | 2019-08-05 | The embedded vector generation method of the ideograph of view-based access control model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110442680A true CN110442680A (en) | 2019-11-12 |
Family
ID=68433204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910717710.9A Pending CN110442680A (en) | 2019-08-05 | 2019-08-05 | The embedded vector generation method of the ideograph of view-based access control model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110442680A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663380A (en) * | 2012-03-30 | 2012-09-12 | 中南大学 | Method for identifying character in steel slab coding image |
CN105009146A (en) * | 2012-06-26 | 2015-10-28 | 艾克尼特有限公司 | Image mask providing a machine-readable data matrix code |
CN107818321A (en) * | 2017-10-13 | 2018-03-20 | 上海眼控科技股份有限公司 | A kind of watermark date recognition method for vehicle annual test |
CN108520258A (en) * | 2018-04-04 | 2018-09-11 | 湖南科技大学 | Character code mark |
CN108805126A (en) * | 2017-04-28 | 2018-11-13 | 上海斯睿德信息技术有限公司 | A kind of long interfering line minimizing technology of text image |
CN109460701A (en) * | 2018-09-10 | 2019-03-12 | 昆明理工大学 | A kind of character recognition method based on histogram in length and breadth |
-
2019
- 2019-08-05 CN CN201910717710.9A patent/CN110442680A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663380A (en) * | 2012-03-30 | 2012-09-12 | 中南大学 | Method for identifying character in steel slab coding image |
CN105009146A (en) * | 2012-06-26 | 2015-10-28 | 艾克尼特有限公司 | Image mask providing a machine-readable data matrix code |
CN108805126A (en) * | 2017-04-28 | 2018-11-13 | 上海斯睿德信息技术有限公司 | A kind of long interfering line minimizing technology of text image |
CN107818321A (en) * | 2017-10-13 | 2018-03-20 | 上海眼控科技股份有限公司 | A kind of watermark date recognition method for vehicle annual test |
CN108520258A (en) * | 2018-04-04 | 2018-09-11 | 湖南科技大学 | Character code mark |
CN109460701A (en) * | 2018-09-10 | 2019-03-12 | 昆明理工大学 | A kind of character recognition method based on histogram in length and breadth |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102490752B1 (en) | Deep context-based grammatical error correction using artificial neural networks | |
CN109871535B (en) | French named entity recognition method based on deep neural network | |
CN108984530B (en) | Detection method and detection system for network sensitive content | |
CN111178094B (en) | Pre-training-based scarce resource neural machine translation training method | |
CN109670041A (en) | A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods | |
CN111897949A (en) | Guided text abstract generation method based on Transformer | |
CN111241232B (en) | Business service processing method and device, service platform and storage medium | |
KR20210151281A (en) | Textrank based core sentence extraction method and device using bert sentence embedding vector | |
CN112380319A (en) | Model training method and related device | |
Tang et al. | FontRNN: Generating Large‐scale Chinese Fonts via Recurrent Neural Network | |
Huang et al. | Addressing domain adaptation for chinese word segmentation with global recurrent structure | |
CN110851594A (en) | Text classification method and device based on multi-channel deep learning model | |
CN111859940B (en) | Keyword extraction method and device, electronic equipment and storage medium | |
CN109993216B (en) | Text classification method and device based on K nearest neighbor KNN | |
CN110866087B (en) | Entity-oriented text emotion analysis method based on topic model | |
CN113535915A (en) | Method for expanding a data set | |
WO2021239631A1 (en) | Neural machine translation method, neural machine translation system, learning method, learning system, and programm | |
CN111178009B (en) | Text multilingual recognition method based on feature word weighting | |
Dilawari et al. | Neural attention model for abstractive text summarization using linguistic feature space | |
CN111523325A (en) | Chinese named entity recognition method based on strokes | |
CN110442680A (en) | The embedded vector generation method of the ideograph of view-based access control model | |
CN114676699A (en) | Entity emotion analysis method and device, computer equipment and storage medium | |
de Lima Santos et al. | Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese. | |
Rajalingam et al. | Image captioning in tamil language with merge architecture | |
Kim et al. | Label Embedding for Improving Classification Accuracy UsingAutoEncoderwithSkip-Connections |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191112 |