CN110532381A - A kind of text vector acquisition methods, device, computer equipment and storage medium - Google Patents
A kind of text vector acquisition methods, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110532381A CN110532381A CN201910637101.2A CN201910637101A CN110532381A CN 110532381 A CN110532381 A CN 110532381A CN 201910637101 A CN201910637101 A CN 201910637101A CN 110532381 A CN110532381 A CN 110532381A
- Authority
- CN
- China
- Prior art keywords
- text
- vector
- encoder
- feature
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
The present invention is suitable for artificial intelligence field, there is provided a kind of text vector acquisition methods, device, computer equipment and storage medium, wherein the described method includes: carrying out text-processing to text obtains target text, text participle is carried out to the target text, obtains corresponding feature text;By pre-set first encoder by the only hot vector space of the feature text code to multidimensional, first eigenvector is obtained;The first eigenvector is encoded to term vector space by pre-set second encoder, obtains second feature vector;The second feature vector and the tag along sort are input in third encoder, iteration loss function, so that hidden layer vector is met same type text similarity greater than different type text similarity, obtain target code network;It is input to the target code network after the text to be processed is handled, obtains the text vector of the text to be processed.The application can enhance the characterization ability of text vector.
Description
Technical field
The invention belongs to field of artificial intelligence more particularly to a kind of text vector acquisition methods, device, computer to set
Standby and storage medium.
Background technique
Currently, natural language processing is an important directions in computer science and artificial intelligence field, with
The rapid development of natural language processing technique, the basic research in natural language processing technique also increasingly obtain the weight of people
Depending among these just including the research to how to generate text vector.In tasks such as text classification, text cluster, similarity calculations
In, it requires to carry out vectorization transformation to text in advance, then carries out mathematics fortune with the original text of the text instead of vectorization
It calculates and statistics, the data that natural language processing is identified for computer capacity is realized and carried out between people and computer with natural language
Communication.In existing natural language processing, Sentence2Vec (sentence vector model) method is by all types of texts pair
The content of text answered is put together by text-processing and is trained as corpus, and the acquisition of text feature is to pass through
Word2vec (term vector model) output term vector simultaneously sums up average computation to the term vector, and the result for summing it up average is straight
It connects as text vector.As it can be seen that there is a problem of that the characterization ability of text is low in the technology that existing text vector obtains.
Summary of the invention
The embodiment of the present invention provides a kind of text vector acquisition methods, it is intended to solve the technology that existing text vector obtains
In, there is a problem of that the characterization ability of text is low.
The embodiments of the present invention are implemented as follows, provides a kind of text vector acquisition methods, comprising steps of
Text participle is carried out at least two different types of target texts for having carried out text-processing, obtains corresponding spy
This solicit articles, wherein text includes tag along sort and content of text;
By pre-set first encoder by the only hot vector space of the feature text code to multidimensional, obtain described
The first eigenvector of feature text;
The first eigenvector is encoded to term vector space by pre-set second encoder, obtains described
The second feature vector of one feature vector;
The second feature vector and the tag along sort are input in third encoder, to the third encoder into
Row training, the loss function of iteration third encoder make hidden layer vector in the third encoder meet same type text phase
It is greater than different type text similarity like degree, obtains target code network;
Text to be processed is obtained, it will be defeated after the text progress text-processing and text participle to be processed
Enter the text vector that the text to be processed is obtained to the target code network.
Further, described that the step of text-processing obtains target text is carried out at least two different types of texts
Include:
Punctuation mark processing is removed to text, obtains the first text;
Capitalization is carried out to first text and turns small letter processing, obtains the second text;
Full-shape is carried out to second text and turns half-angle processing, obtains target text.
Further, described the step of carrying out text participle to target text, obtaining corresponding feature text, includes:
Word segmentation processing is carried out to the target text by segmenter, obtains word segmentation result;And
Word segmentation result is formed into feature text.
Further, it is described obtain word segmentation result after, comprising steps of
Being detected by pre-set deactivated dictionary whether there is stop words in the word segmentation result;
If it exists, then the stop words is deleted.
Further, described that the first eigenvector is encoded to by term vector by pre-set second encoder
Space, the step of obtaining the second feature vector of the first eigenvector include:
By be set in advance in input layer in second encoder to hidden layer weight matrix, to the first eigenvector
Dimensionality reduction is carried out, the second feature vector of the hidden layer is obtained.
Further, described that the second feature vector and the tag along sort are input in third encoder, it is right
The third encoder be trained comprising steps of
The second feature vector and the tag along sort are input in noise reduction autocoder, to the second feature
Vector carries out random damage, obtains third feature vector;
The noise reduction autocoder is trained based on the third feature vector.
Further, the loss function of the iteration third encoder makes hidden layer vector in the third encoder
The step of meeting same type text similarity greater than different type text similarity, obtaining target code network include:
By the tag along sort, the inner product between the text vector of each text is calculated;
The inner product result of each text is compared, the similarity of each text is obtained;
According to the similarity of each text, the target code network is formed, wherein the target code network includes
First encoder, second encoder and third encoder.
The present invention also provides a kind of text vector acquisition device, comprising:
Processing module obtains target text for carrying out text-processing at least two different types of texts, to described
Target text carries out text participle, obtains corresponding feature text, wherein the text includes tag along sort and content of text;
First coding module, for by pre-set first encoder that the feature text code is solely warm to multidimensional
Vector space obtains the first eigenvector of the feature text;
Second coding module, for by pre-set second encoder by the first eigenvector be encoded to word to
Quantity space obtains the second feature vector of the first eigenvector;
Training module, for the second feature vector and the tag along sort to be input in third encoder, to institute
It states third encoder to be trained, the loss function of iteration third encoder, keeps hidden layer vector in the third encoder full
Sufficient same type text similarity is greater than different type text similarity, obtains target code network;
The text to be processed is carried out the text-processing and institute for obtaining text to be processed by input module
It is input to the target code network after stating text participle, obtains the text vector of the text to be processed.
The present invention also provides a kind of computer equipment, including memory and processor, calculating is stored in the memory
Machine program, the processor realize a kind of text as described in any one of claim one to seven when executing the computer program
The step of this vector-obtaining method.
The present invention also provides a kind of computer readable storage mediums, which is characterized in that the computer readable storage medium
On be stored with computer program, when the computer program is executed by processor realize such as any one of claim one to seven institute
A kind of the step of text vector acquisition methods stated.
It is that the present invention realizes the utility model has the advantages that the present invention due to by the target text carry out text participle, be based on institute
State the first encoder and the second encoder, by the feature text encoded to obtain the first eigenvector and
The second feature vector (term vector) corresponding with the first eigenvector, and by the second feature vector and described point
Class label is input to the third encoder and is trained, and allows the second feature vector to be damaged or pollute, by model training
To similarity of the same type text similarity greater than different type text is met, so that the text vector made is more steady
It is fixed, the characterization ability enhancing for the text vector being made of the term vector.
Detailed description of the invention
Fig. 1 is that this application can be applied to exemplary system architecture figures therein;
Fig. 2 is the flow chart of one embodiment of text vector acquisition methods provided in an embodiment of the present invention;
Fig. 3 is a kind of flow chart of specific embodiment of S201 in Fig. 2;
Fig. 4 is the flow chart of another specific embodiment of S201 in Fig. 2;
Fig. 5 is a kind of flow chart of specific embodiment of S401 in Fig. 4;
Fig. 6 is a kind of flow chart of specific embodiment of S203 in Fig. 2;
Fig. 7 is a kind of flow chart of specific embodiment of S204 in Fig. 2;
Fig. 8 is the flow chart of another specific embodiment of S204 in Fig. 2;
Fig. 9 is a kind of structural schematic diagram of text vector acquisition device provided in an embodiment of the present invention;
Figure 10 is a kind of structural schematic diagram of specific embodiment of processing module shown in Fig. 9;
Figure 11 is the structural schematic diagram of another specific embodiment of processing module shown in Fig. 9;
Figure 12 is the structural schematic diagram of another specific embodiment of processing module shown in Fig. 9;
Figure 13 is a kind of structural schematic diagram of specific embodiment of training module shown in Fig. 9;
Figure 14 is a kind of structural schematic diagram of specific embodiment of training module shown in Fig. 9;
Figure 15 is the structural schematic diagram of one embodiment of the computer equipment of the application.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
The present invention is due to being based on the first encoder and second encoder, inciting somebody to action by carrying out text participle to target text
Feature text is encoded to obtain first eigenvector and second feature vector (term vector) corresponding with first eigenvector,
And second feature vector is input to third encoder with tag along sort and is trained, allow second feature vector to be damaged or dirty
Dye, by model training to similarity of the same type text similarity greater than different type text is met, thus the text made
Vector is more stable, the characterization ability enhancing for the text vector being made of term vector.
As shown in Figure 1, system architecture 100 may include server 105, network 102 and terminal device 101,102,103.
Network 104 between server 105 and terminal device 101,102,103 to provide the medium of communication link.Network 104 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..Terminal device 101,102,103 can be with
It is that can carry out the various electronic equipments that text is shown etc., including but not limited to intelligence with display screen with downloading application software
Energy mobile phone, tablet computer, pocket computer on knee and desktop computer etc..Server 105 can be to provide various services
Server, such as provide the page that shows on terminal device 101,102,103 background server supported.Client can make
It is interacted by network 104 with server 105 with terminal device 101,102,103, to receive or obtain information etc..
It should be noted that a kind of text vector acquisition methods provided by the embodiment of the present application can be by server/end
End equipment executes, and correspondingly, a kind of text vector acquisition device can be set in server/terminal equipment.
It should be understood that the number of mobile terminal, network in Fig. 1 and equipment is only schematical, according to realizing needs,
It can have any number of mobile terminal, network and server.
As shown in Fig. 2, being the process of one embodiment according to provided by a kind of text vector acquisition methods of the application
Figure.A kind of above-mentioned text vector acquisition methods, comprising steps of
S201 carries out text participles at least two different types of target texts for having carried out text-processing, obtains pair
The feature text answered, wherein text includes tag along sort and content of text.
In the present embodiment, electronic equipment (such as the shifting shown in FIG. 1 of a kind of text vector acquisition methods operation thereon
Dynamic terminal).Wherein, text can be to the text under information classification, such as: sports news, entertainment news under news category
Text under equal columns;The text being also possible under classifying content, such as: in each forum in different plates in text;Certainly,
It can also be the text of the text under insurance classification or other forms formed with natural language, in the embodiment of the present invention
In, this is not construed as limiting.Tag along sort can be used for illustrating the type (being referred to as classification) of text, and content of text can be
Refer to the text information formed with natural language text.
Specifically, above-mentioned text-processing can be understood as handling the content of text in text, by content of text
The natural language text convenient for computer disposal is converted to, in order to increase the speed of computer disposal, to the content in text
Format and character make corresponding processing, such as: text formatting is converted into TXT format, the unrelated word of removal etc..Wherein, to mesh
It marks text and carries out text participle, participle tool can be used, text participle is carried out to text, obtaining corresponding feature text can be with
It is interpreted as segmenting a text, available multiple phrases, and feature text includes multiple phrases.
S202 obtains spy by pre-set first encoder by the only hot vector space of feature text code to multidimensional
The first eigenvector for sheet of soliciting articles.
In the present embodiment, the coding rule of above-mentioned first encoder can be user's self-defining, be also possible to using
Online disclosed coding rule, above-mentioned first encoder can be one-hot (solely heat) encoder, can will be in feature text
Participle is encoded into one-hot vector, to obtain first eigenvector.It is above-mentioned that by feature text code to multidimensional, solely hot vector is empty
Between, it may be implemented quickly to the coding of feature text.
First eigenvector is encoded to term vector space by pre-set second encoder, obtains first by S203
The second feature vector of feature vector.
In embodiments of the present invention, in the way of deep learning, text conversion be multi-C vector space in vector into
Row calculates.Above-mentioned second encoder can be word2vec, by the first eigenvector of obtained feature text as
The input vector of word2vec is encoded, and first eigenvector can be mapped to term vector space, sought for text data
More profound character representation.Continuous bag of words in word2vec can be used to carry out the first eigenvector of input in advance
It surveys, to obtain second feature vector (term vector).
Second feature vector and tag along sort are input in third encoder, are trained to third encoder by S204,
The loss function of iteration third encoder makes hidden layer vector in third encoder meet same type text similarity greater than difference
Type text similarity obtains target code network.
In embodiments of the present invention, hidden layer can be not provided with activation primitive (Activation Function), i.e., only need
Want the feature vector of hidden layer, the loss function of third encoder can be with are as follows:
Wherein, function LR() is the squared error function of basic loss function, LR(yn, xn)=‖ yn-xn‖ ^2, LT(h0, h1,
h2)=Sim (h0, h1)-Sim(h0, h2), α is the real number between 0 to 1, and Sim () is interior Product function.By tag along sort, until
Calculated result allows hidden layer vector in third encoder to meet same type text similarity greater than different type text similarity, example
Such as: X0With X1For same text type, similarity 80%, X2With X0For different text types, similarity 1%.Certainly, phase
It can also refer to distance like degree, such as: Beijing is same text type with Tianjin, and Beijing is different text types from Xinjiang.This
The characterization ability of text vector can be enhanced in sample.
S205 obtains text to be processed, is input to mesh after text to be processed is carried out text-processing and text participle
Coding network is marked, the text vector of text to be processed is obtained.
In embodiments of the present invention, text to be processed can be the text for needing that feature extraction is carried out according to text information
This, can be newly-increased text, such as the text that user newly uploads or newly grabs.The hidden layer of third encoder does not activate letter
Number, can be by target code network in third encoder the available text to be processed of hidden layer text vector,
Be both by will be completed text-processing and text participle after text to be processed, be input to complete training and can be complete
In the third encoder of constituent class, the text vector with categorical attribute is obtained.
The present invention is due to by carrying out text participle, the mind based on the first encoder and second encoder to target text
Through network, feature text is encoded to obtain first eigenvector and second feature vector corresponding with first eigenvector
(term vector), and second feature vector is input to third encoder with tag along sort and is trained, allows second feature vector quilt
Model training, is arrived the similarity for meeting same type text similarity greater than different type text, then will acquire by damage or pollution
After the text to be processed arrived carries out text-processing and text participle, be input to completed in trained third encoder into
Row coding, so that the text vector made is more stable, the characterization ability for the text vector that enhancing term vector is constituted.
Further, as shown in figure 3, the step of S201 includes:
S301 is removed punctuation mark processing to text, obtains the first text;
S302 carries out capitalization to the first text and turns small letter processing, obtains the second text;
S303 carries out full-shape to the second text and turns half-angle processing, obtains target text.
Wherein it is possible to by the regular expression of " canonical matching " come processing character string, can with some specific characters come
The rule that character occurs in character string is described, to match, extract or replace the character string for meeting some rule, can also be used
It searches, delete and substitute character string, search speed is fast and precisely.
Specifically, text is matched with character expression, when being matched in text there are when punctuation mark, to mark
Point symbol carries out delete processing, the first text after obtaining delete processing, wherein character expression can refer to for matching text
The regular expression of punctuation mark in this, specific regular expression can be " pP+~$ `^=|<>~` $ ^+=|
< > $ ×] $ ".Such as: there are texts are as follows: [health], during sleep at night, these " performances " occurs in body, it may be possible to which body is suffered from
There is disease.By with character expression " pP+~$ `^=|<>~` $ ^+=| < > $ ×] $ " text is matched,
Obtain in text that there are symbols are as follows: " [] ", ", " " " " ", "!", then by symbol " [] ", ", " " " " ", "!" etc. deleted after, obtain
To treated the first text are as follows: these performances occurs to health in body during sleep at night may be body with disease.
More specifically, according to the first obtained text, each character in the first text is traversed, with alphabetical converting expressing
Formula matches each character in the first text, if be matched to character be capitalization when, can be capitalization
It is converted into lowercase character, after the completion of all character match, using the first text after the completion of matching as the second text.Its
In, alphabetical transformed representation can refer to dedicated for the capitalization in the first title of matching, and capitalization is converted into
The regular expression of lowercase, specific regular expression can be $ reg='/(w+)/e'.It is then possible to be by
Two texts imported into progress half-angle conversion process, the target text after obtaining conversion process in preset transformation warehouse, wherein default
Transformation warehouse can refer to the double byte character in the second text for identification, and double byte character is converted into the data of half-angle character
Library specifically can be used canonical matching and be handled, the script pre-set also can be used and handled.
In this way, by carrying out punctuation mark deletion to text, primary and secondary capitalization turns small letter and full-shape turns the processing of half-angle, from
And a target text is obtained, the processing speed of computer can be enhanced.
Further, as shown in figure 4, the step of S201 further include:
S401 carries out word segmentation processing to target text by segmenter, obtains word segmentation result;And
Word segmentation result is formed feature text by S402.
In embodiments of the present invention, it can be and target text imported into jieba (stammerer) segmenter, selection participle mould
Formula is segmented, and participle mode may include syntype, accurate mode, new word identification, search engine mode etc., wherein neologisms
Identification can customized addition neologisms, the preferred accurate model of present embodiment segmented.Such as: if target text are as follows: the Forbidden City
Famous sites include Palace of Heavenly Purity, the Hall of Supreme Harmony and yellow glazed tiles etc., available word segmentation result is segmented by accurate model: therefore
Palace// famous sites/includes/and the dry/palace of the Qing Dynasty/the Hall of Supreme Harmony/and/Huang/glazed tiles/etc..The word segmentation result obtained from can be used as
One feature text.
In this way, carrying out participle stars feature text corresponding with target text, Ke Yiyou to target text by segmenter
Feature text is encoded conducive to encoder.
Further, as shown in figure 5, after S401, comprising steps of
S501, being detected by pre-set deactivated dictionary whether there is stop words in word segmentation result;
S502, and if it exists, then stop words is deleted.
In embodiments of the present invention, it can be and obtain all stop words from default deactivated dictionary, then can will segment
As a result each participle in is compared with stop words, includes and at least one phase in stop words when being matched in word segmentation result
Meanwhile by participle identical with stop words in word segmentation result carry out delete processing, and by execute delete processing after word segmentation result
As the word segmentation result for indicating text feature;It can also be when detecting in word segmentation result there is no stop words, it can be direct
Using word segmentation result as the word segmentation result for indicating text feature.Wherein, default deactivated dictionary, which refers to, can be used for storing stop words
Database.
In this way, further being deleted in word segmentation result by the way that word segmentation result to be compared with the stop words in deactivated dictionary
The stop words of appearance is conducive to get the word segmentation result with more text feature.
Further, as shown in fig. 6, the step of S203 includes:
S601, by be set in advance in input layer in second encoder to hidden layer weight matrix, to fisrt feature to
Amount carries out dimensionality reduction, obtains the second feature vector of hidden layer.
In embodiments of the present invention, the dimensionality reduction to first eigenvector may be implemented in weight matrix, and weight matrix can be pre-
The input layer of second encoder is first set between hidden layer.First eigenvector is subjected to dimensionality reduction, obtain second feature to
Amount, second feature vector can be indicated, term vector may include dimension for indicating term vector and quantity with the form of matrix
Angle value (columns indicate number of dimensions) in matrix, word quantity can be the quantity of term vector, and (word quantity i.e. in dictionary is
Line number in matrix.
In this way, dimension disaster can be reduced by the way that second feature vector coding is carried out dimensionality reduction into weight matrix, reduce
Calculation amount.
Further, as shown in fig. 7, S204 comprising steps of
Second feature vector and tag along sort are input in noise reduction autocoder by S701, to second feature vector into
Row random damage obtains third feature vector;
S702 is trained noise reduction autocoder based on third feature vector.
In embodiments of the present invention, second feature vector (term vector) and tag along sort are input to noise reduction autocoder
In handled so that term vector is contaminated or random damage, be contaminated or random damage after term vector as third feature
Then vector is trained noise reduction autocoder by third feature vector.
In this way, be damaged in term vector or pollute the target code network trained, obtained text vector with regard to more stable,
The robustness of noise reduction autocoder and coding network can be increased, to improve the robustness of entire target code network.
Further, as shown in figure 8, the step of S204 further include:
S801 calculates the inner product between the text vector of each text by tag along sort;
S802 is compared the inner product result of each text, obtains the similarity of each text;
S803 forms target code network according to the similarity of each text, wherein target code network includes the first volume
Code device, second encoder and third encoder.
In embodiments of the present invention, it can make sentence vector that there is text type attribute by tag along sort, it is above-mentioned
Inner product between text can be used to indicate that the similarity between each text, such as: it is respectively x there are three types of text0、x1、x2,
Wherein, x0、x1For same type text, x0、x2For different type text, then the corresponding spy in the hidden layer of third encoder
Levying vector (text vector) is respectively h0、h1、h2, weight is adjusted by training, so that h0With h1Similarity be greater than h0With h2's
Similarity:
Sim(h0, h1) > Sim (h0, h2)
That is x0For target text, x1For with x0The text of same type, X2For with X0Different types of text, as each
Text X finds at least one same type and different types of text.Such as: X0 is jpg format, and X1 is jpg format, and X2 is
Docx format.
In this way, there is text type attribute by sentence vector according to tag along sort, classification then is completed by calculating
Similarity between each text and target text finds same type text and different type text with target text, makes
Obtain the characterization ability enhancing of text.
The present invention is since by carrying out to text, punctuation mark removal, capitalization turns small letter and full-shape turns the processing shape of half-angle
At target text, and then target text is carried out to form feature text, base after the deletion of text word segmentation processing and stop words
Feature text code is exported into first eigenvector to the hot vector space of various dimensions in the first encoder, is then based on the second coding
The neural network of device exports corresponding second feature after the weight matrix that first eigenvector is input to hidden layer is carried out dimensionality reduction
Then second feature vector is input to noise reduction autocoder with tag along sort and is trained by vector (term vector), allow second
Feature vector is calculated the inner product between text, is compared inner product as a result, model training is similar to meeting by random damage or pollution
Type text similarity is greater than the similarity of different type text, and the text made has text type attribute, then will acquire
Text to be processed carry out text-processing and text participle after, be input to and completed in trained noise reduction autocoder
Encoded, allow corresponding sentence vector have text type attribute so that from the text that corresponding term vector is constituted to
Measure it is more stable, enhance term vector composition text vector characterization ability.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, which can be stored in a computer-readable storage and be situated between
In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, storage medium above-mentioned can be
The non-volatile memory mediums such as magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random storage note
Recall body (Random Access Memory, RAM) etc..
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow,
These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps
Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing
Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps
Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other
At least part of the sub-step or stage of step or other steps executes in turn or alternately.
As shown in figure 9, for a kind of structural schematic diagram of text vector acquisition device provided by the present embodiment, above-mentioned apparatus
900 include: processing module 901, the first coding module 902, the second coding module 903, training module 904, input module 905.
Wherein:
Processing module 901 obtains target text for carrying out text-processing at least two different types of texts, to institute
It states target text and carries out text participle, obtain corresponding feature text, wherein the text includes in tag along sort and text
Hold;
First coding module 902, for by pre-set first encoder that feature text code is solely warm to multidimensional
Vector space obtains the first eigenvector of feature text;
Second coding module 903, for by pre-set second encoder by first eigenvector be encoded to word to
Quantity space obtains the second feature vector of first eigenvector;
Training module 904 encodes third for second feature vector and tag along sort to be input in third encoder
Device is trained, the loss function of iteration third encoder, and hidden layer vector in third encoder is made to meet same type text phase
It is greater than different type text similarity like degree, obtains target code network;
Text to be processed is carried out text-processing and text segments by input module 905 for obtaining text to be processed
After be input to target code network, obtain the text vector of text to be processed.
It further, as shown in Figure 10, is a kind of structural schematic diagram of specific embodiment of processing module 901, comprising:
First processing submodule 9011, second processing submodule 9012, third handle submodule 9013.Wherein,
First processing submodule 9011 obtains the first text for being removed punctuation mark processing to text;
Second processing submodule 9012 turns small letter processing for carrying out capitalization to the first text, obtains the second text;
Third handles submodule 9013, turns half-angle processing for carrying out full-shape to the second text, obtains target text.
It further, as shown in figure 11, is the structural schematic diagram of another specific embodiment of processing module 901, also
It include: that participle submodule 9014, first generates submodule 9015.Wherein,
Submodule 9014 is segmented, for carrying out word segmentation processing to target text by segmenter, obtains word segmentation result;With
And;
First generates submodule 9015, for word segmentation result to be formed feature text.
It further, as shown in figure 12, is the structural schematic diagram of another specific embodiment of processing module 901, also
Include: detection sub-module 9016, delete submodule 9017.Wherein,
Detection sub-module 9016, for being detected in word segmentation result by pre-set deactivated dictionary with the presence or absence of deactivated
Word;
Submodule 9017 is deleted, for if it exists, then deleting stop words.
Further, the second coding module 903 is also used to by being set in advance in second encoder input layer to implicit
The weight matrix of layer carries out dimensionality reduction to first eigenvector, obtains the second feature vector of hidden layer.
It further, as shown in figure 13, is a kind of structural schematic diagram of specific embodiment of training module 904, comprising:
Input submodule 9041, training submodule 9042.Wherein,
Input submodule 9041 is right for second feature vector and tag along sort to be input in noise reduction autocoder
Second feature vector carries out random damage, obtains third feature vector;
Training submodule 9042, for being trained based on third feature vector to noise reduction autocoder.
It further, as shown in figure 14, is the structural schematic diagram of another specific embodiment of training module 904, packet
Include: computational submodule 9043, Comparative sub-module 9044, second generate submodule 9045.Wherein,
Computational submodule 9043, for calculating the inner product between the text vector of each text by tag along sort;
Comparative sub-module 9044 is compared for the inner product result to each text, obtains the similarity of each text
Second generates submodule 9045, for the similarity according to each text, forms target code network, wherein target
Coding network includes the first encoder, second encoder and third encoder.
A kind of text vector acquisition device provided by the embodiments of the present application can be realized in the embodiment of the method for Fig. 2 to Fig. 8
Each embodiment and corresponding beneficial effect, to avoid repeating, which is not described herein again.
In order to solve the above technical problems, the embodiment of the present application also provides computer equipment.Referring specifically to Figure 15, Tu15Wei
The present embodiment computer equipment basic structure block diagram.
Computer equipment 15 includes that connection memory 151, processor 152, network interface are in communication with each other by system bus
153.It should be pointed out that the computer equipment 15 with component 151-153 is illustrated only in figure, it should be understood that simultaneously
All components shown realistic are not applied, the implementation that can be substituted is more or less component.Wherein, the art technology
Personnel are appreciated that computer equipment here is that one kind can carry out automatically numerical value according to the instruction for being previously set or storing
The equipment of calculating and/or information processing, hardware includes but is not limited to microprocessor, specific integrated circuit (Application
Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable GateArray,
FPGA), digital processing unit (Digital Signal Processor, DSP), embedded device etc..
Computer equipment can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment.Meter
Human-computer interaction can be carried out by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with client by calculating machine equipment.
Memory 151 includes at least a type of readable storage medium storing program for executing, and readable storage medium storing program for executing includes flash memory, hard disk, more
Media card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory
(SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory
(PROM), magnetic storage, disk, CD etc..In some embodiments, memory 151 can be the interior of computer equipment 15
Portion's storage unit, such as the hard disk or memory of the computer equipment 15.In further embodiments, memory 151 is also possible to
The plug-in type hard disk being equipped on the External memory equipment of computer equipment 15, such as the computer equipment 15, intelligent memory card
(Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..When
So, memory 151 can also both including computer equipment 15 internal storage unit and also including its External memory equipment.This implementation
In example, memory 151 is installed on the operating system and types of applications software of computer equipment 15 commonly used in storage, such as a kind of
The program code etc. of text vector acquisition methods.It has exported or has incited somebody to action in addition, memory 151 can be also used for temporarily storing
The Various types of data to be exported.
Processor 152 can be in some embodiments central processing unit (Central Processing Unit, CPU),
Controller, microcontroller, microprocessor or other data processing chips.The processor 152 is commonly used in control computer equipment
15 overall operation.In the present embodiment, program code or processing number of the processor 152 for being stored in run memory 151
According to, such as run a kind of program code of text vector acquisition methods.
Network interface 153 may include radio network interface or wired network interface, which is commonly used in counting
It calculates to establish between machine equipment 15 and other electronic equipments and communicate to connect.
Present invention also provides another embodiments, that is, provide a kind of computer readable storage medium, computer-readable
Storage medium is stored with a kind of text vector and obtains program, and a kind of above-mentioned text vector obtains program can be by least one processor
It executes, so that at least one processor is executed such as the step of a kind of above-mentioned text vector acquisition methods.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, the technical solution of the application substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in a storage medium
In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, computer, clothes
Be engaged in device, air conditioner or the network equipment etc.) execute each embodiment of the application a kind of text vector acquisition methods.
Term " includes " in the description and claims of this application and above-mentioned Detailed description of the invention and " having " and it
Any deformation, it is intended that cover and non-exclusive include.In the description and claims of this application or above-mentioned attached drawing
Term " first ", " second " etc. be to be not use to describe a particular order for distinguishing different objects.It is referenced herein
" embodiment " is it is meant that a particular feature, structure, or characteristic described may be embodied at least one of the application in conjunction with the embodiments
In embodiment.The phrase, which occurs, in each position in the description might not each mean identical embodiment, nor and its
The independent or alternative embodiment of its embodiment mutual exclusion.Those skilled in the art explicitly and implicitly understand, herein
Described embodiment can be combined with other embodiments.
The above is merely preferred embodiments of the present invention, be not intended to limit the invention, it is all in spirit of the invention and
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within principle.
Claims (10)
1. a kind of text vector acquisition methods, which is characterized in that comprising steps of
Text-processing is carried out at least two different types of texts and obtains target text, text point is carried out to the target text
Word obtains corresponding feature text, wherein the text includes tag along sort and content of text;
By pre-set first encoder by the only hot vector space of the feature text code to multidimensional, the feature is obtained
The first eigenvector of text;
The first eigenvector is encoded to term vector space by pre-set second encoder, it is special to obtain described first
Levy the second feature vector of vector;
The second feature vector and the tag along sort are input in third encoder, the third encoder is instructed
Practice, the loss function of iteration third encoder makes hidden layer vector in the third encoder meet same type text similarity
Greater than different type text similarity, target code network is obtained;
Text to be processed is obtained, is input to after the text to be processed is carried out the text-processing and text participle
The target code network obtains the text vector of the text to be processed.
2. a kind of text vector acquisition methods according to claim 1, which is characterized in that described at least two inhomogeneities
The text of type carries out the step of text-processing obtains target text and includes:
Punctuation mark processing is removed to text, obtains the first text;
Capitalization is carried out to first text and turns small letter processing, obtains the second text;
Full-shape is carried out to second text and turns half-angle processing, obtains target text.
3. a kind of text vector acquisition methods according to claim 1, which is characterized in that described to carry out text to target text
This participle, the step of obtaining corresponding feature text include:
Word segmentation processing is carried out to the target text by segmenter, obtains word segmentation result;And
Word segmentation result is formed into feature text.
4. a kind of text vector acquisition methods according to claim 3, which is characterized in that it is described obtain word segmentation result it
Afterwards, comprising steps of
Being detected by pre-set deactivated dictionary whether there is stop words in the word segmentation result;
If it exists, then the stop words is deleted.
5. a kind of text vector acquisition methods according to claim 1, which is characterized in that described to pass through pre-set
The first eigenvector is encoded to term vector space by two encoders, obtains the second feature vector of the first eigenvector
The step of include:
The first eigenvector is carried out to the weight matrix of hidden layer by being set in advance in input layer in second encoder
Dimensionality reduction obtains the second feature vector of the hidden layer.
6. a kind of text vector acquisition methods according to claim 1, which is characterized in that it is described by the second feature to
Amount is input in third encoder with the tag along sort, the third encoder is trained comprising steps of
The second feature vector and the tag along sort are input in noise reduction autocoder, to the second feature vector
Random damage is carried out, third feature vector is obtained;
The noise reduction autocoder is trained based on the third feature vector.
7. a kind of text vector acquisition methods according to claim 1, which is characterized in that the iteration third encoder
Loss function makes in the third encoder hidden layer vector meet same type text similarity similar greater than different type text
Degree, the step of obtaining target code network include:
By the tag along sort, the inner product between the text vector of each text is calculated;
The inner product result of each text is compared, the similarity of each text is obtained;
According to the similarity of each text, the target code network is formed, wherein the target code network includes described
First encoder, second encoder and third encoder.
8. a kind of text vector acquisition device characterized by comprising
Processing module carries out text-processing at least two different types of texts and obtains target text, to the target text
Text participle is carried out, obtains corresponding feature text, wherein the text includes tag along sort and content of text;
First coding module, for by pre-set first encoder by the only hot vector of the feature text code to multidimensional
Space obtains the first eigenvector of the feature text;
Second coding module, for the first eigenvector to be encoded to term vector sky by pre-set second encoder
Between, obtain the second feature vector of the first eigenvector;
Training module, for the second feature vector and the tag along sort to be input in third encoder, to described the
Three encoders are trained, the loss function of iteration third encoder, meet hidden layer vector in the third encoder together
Type text similarity is greater than different type text similarity, obtains target code network;
The text to be processed is carried out the text-processing and the text for obtaining text to be processed by input module
It is input to the target code network after this participle, obtains the text vector of the text to be processed.
9. a kind of computer equipment, including memory and processor, computer program, the processing are stored in the memory
Device realizes a kind of text vector acquisition methods as described in any one of claims 1 to 7 when executing the computer program
Step.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program realizes a kind of text vector as described in any one of claims 1 to 7 when the computer program is executed by processor
The step of acquisition methods.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910637101.2A CN110532381B (en) | 2019-07-15 | 2019-07-15 | Text vector acquisition method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910637101.2A CN110532381B (en) | 2019-07-15 | 2019-07-15 | Text vector acquisition method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110532381A true CN110532381A (en) | 2019-12-03 |
CN110532381B CN110532381B (en) | 2023-09-26 |
Family
ID=68660195
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910637101.2A Active CN110532381B (en) | 2019-07-15 | 2019-07-15 | Text vector acquisition method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110532381B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990837A (en) * | 2020-02-29 | 2020-04-10 | 网御安全技术(深圳)有限公司 | System call behavior sequence dimension reduction method, system, equipment and storage medium |
CN111079442A (en) * | 2019-12-20 | 2020-04-28 | 北京百度网讯科技有限公司 | Vectorization representation method and device of document and computer equipment |
CN111445545A (en) * | 2020-02-27 | 2020-07-24 | 北京大米未来科技有限公司 | Text-to-map method, device, storage medium and electronic equipment |
CN112528681A (en) * | 2020-12-18 | 2021-03-19 | 北京百度网讯科技有限公司 | Cross-language retrieval and model training method, device, equipment and storage medium |
CN112749530A (en) * | 2021-01-11 | 2021-05-04 | 北京光速斑马数据科技有限公司 | Text encoding method, device, equipment and computer readable storage medium |
WO2021134416A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Text transformation method and apparatus, computer device, and computer readable storage medium |
WO2021143020A1 (en) * | 2020-01-14 | 2021-07-22 | 平安科技(深圳)有限公司 | Bad term recognition method and device, electronic device, and storage medium |
CN115047894A (en) * | 2022-04-14 | 2022-09-13 | 中国民用航空总局第二研究所 | Unmanned aerial vehicle track measuring and calculating method, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180121801A1 (en) * | 2016-10-28 | 2018-05-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for classifying questions based on artificial intelligence |
CN109408702A (en) * | 2018-08-29 | 2019-03-01 | 昆明理工大学 | A kind of mixed recommendation method based on sparse edge noise reduction autocoding |
CN109582786A (en) * | 2018-10-31 | 2019-04-05 | 中国科学院深圳先进技术研究院 | A kind of text representation learning method, system and electronic equipment based on autocoding |
CN109885826A (en) * | 2019-01-07 | 2019-06-14 | 平安科技(深圳)有限公司 | Text term vector acquisition methods, device, computer equipment and storage medium |
-
2019
- 2019-07-15 CN CN201910637101.2A patent/CN110532381B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180121801A1 (en) * | 2016-10-28 | 2018-05-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for classifying questions based on artificial intelligence |
CN109408702A (en) * | 2018-08-29 | 2019-03-01 | 昆明理工大学 | A kind of mixed recommendation method based on sparse edge noise reduction autocoding |
CN109582786A (en) * | 2018-10-31 | 2019-04-05 | 中国科学院深圳先进技术研究院 | A kind of text representation learning method, system and electronic equipment based on autocoding |
CN109885826A (en) * | 2019-01-07 | 2019-06-14 | 平安科技(深圳)有限公司 | Text term vector acquisition methods, device, computer equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
张素智 等: ""面向聚类的堆叠降噪自动编码器的特征提取研究"", 《现代计算机》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079442A (en) * | 2019-12-20 | 2020-04-28 | 北京百度网讯科技有限公司 | Vectorization representation method and device of document and computer equipment |
CN111079442B (en) * | 2019-12-20 | 2021-05-18 | 北京百度网讯科技有限公司 | Vectorization representation method and device of document and computer equipment |
US11403468B2 (en) | 2019-12-20 | 2022-08-02 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for generating vector representation of text, and related computer device |
WO2021134416A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Text transformation method and apparatus, computer device, and computer readable storage medium |
WO2021143020A1 (en) * | 2020-01-14 | 2021-07-22 | 平安科技(深圳)有限公司 | Bad term recognition method and device, electronic device, and storage medium |
CN111445545A (en) * | 2020-02-27 | 2020-07-24 | 北京大米未来科技有限公司 | Text-to-map method, device, storage medium and electronic equipment |
CN111445545B (en) * | 2020-02-27 | 2023-08-18 | 北京大米未来科技有限公司 | Text transfer mapping method and device, storage medium and electronic equipment |
CN110990837B (en) * | 2020-02-29 | 2023-03-24 | 网御安全技术(深圳)有限公司 | System call behavior sequence dimension reduction method, system, equipment and storage medium |
CN110990837A (en) * | 2020-02-29 | 2020-04-10 | 网御安全技术(深圳)有限公司 | System call behavior sequence dimension reduction method, system, equipment and storage medium |
CN112528681A (en) * | 2020-12-18 | 2021-03-19 | 北京百度网讯科技有限公司 | Cross-language retrieval and model training method, device, equipment and storage medium |
CN112749530A (en) * | 2021-01-11 | 2021-05-04 | 北京光速斑马数据科技有限公司 | Text encoding method, device, equipment and computer readable storage medium |
CN112749530B (en) * | 2021-01-11 | 2023-12-19 | 北京光速斑马数据科技有限公司 | Text encoding method, apparatus, device and computer readable storage medium |
CN115047894A (en) * | 2022-04-14 | 2022-09-13 | 中国民用航空总局第二研究所 | Unmanned aerial vehicle track measuring and calculating method, electronic equipment and storage medium |
CN115047894B (en) * | 2022-04-14 | 2023-09-15 | 中国民用航空总局第二研究所 | Unmanned aerial vehicle track measuring and calculating method, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110532381B (en) | 2023-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110532381A (en) | A kind of text vector acquisition methods, device, computer equipment and storage medium | |
CN108959246B (en) | Answer selection method and device based on improved attention mechanism and electronic equipment | |
CN110909548A (en) | Chinese named entity recognition method and device and computer readable storage medium | |
CN104834747A (en) | Short text classification method based on convolution neutral network | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN110866098B (en) | Machine reading method and device based on transformer and lstm and readable storage medium | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
WO2020056977A1 (en) | Knowledge point pushing method and device, and computer readable storage medium | |
CN110738049B (en) | Similar text processing method and device and computer readable storage medium | |
CN111737997A (en) | Text similarity determination method, text similarity determination equipment and storage medium | |
WO2021051934A1 (en) | Method and apparatus for extracting key contract term on basis of artificial intelligence, and storage medium | |
CN113505601A (en) | Positive and negative sample pair construction method and device, computer equipment and storage medium | |
CN112084342A (en) | Test question generation method and device, computer equipment and storage medium | |
CN111488732A (en) | Deformed keyword detection method, system and related equipment | |
CN112329463A (en) | Training method of remote monitoring relation extraction model and related device | |
CN110222144B (en) | Text content extraction method and device, electronic equipment and storage medium | |
CN110019674A (en) | A kind of text plagiarizes detection method and system | |
CN113434636A (en) | Semantic-based approximate text search method and device, computer equipment and medium | |
Kokane et al. | Word sense disambiguation: a supervised semantic similarity based complex network approach | |
CN110363206A (en) | Cluster, data processing and the data identification method of data object | |
CN110321565B (en) | Real-time text emotion analysis method, device and equipment based on deep learning | |
CN114398903B (en) | Intention recognition method, device, electronic equipment and storage medium | |
Nguyen et al. | A feature-word-topic model for image annotation | |
CN114722774B (en) | Data compression method, device, electronic equipment and storage medium | |
JP7236501B2 (en) | Transfer learning method and computer device for deep learning model based on document similarity learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |