CN110532381A - A kind of text vector acquisition methods, device, computer equipment and storage medium - Google Patents

A kind of text vector acquisition methods, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110532381A
CN110532381A CN201910637101.2A CN201910637101A CN110532381A CN 110532381 A CN110532381 A CN 110532381A CN 201910637101 A CN201910637101 A CN 201910637101A CN 110532381 A CN110532381 A CN 110532381A
Authority
CN
China
Prior art keywords
text
vector
encoder
feature
obtains
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910637101.2A
Other languages
Chinese (zh)
Other versions
CN110532381B (en
Inventor
唐亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN201910637101.2A priority Critical patent/CN110532381B/en
Publication of CN110532381A publication Critical patent/CN110532381A/en
Application granted granted Critical
Publication of CN110532381B publication Critical patent/CN110532381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The present invention is suitable for artificial intelligence field, there is provided a kind of text vector acquisition methods, device, computer equipment and storage medium, wherein the described method includes: carrying out text-processing to text obtains target text, text participle is carried out to the target text, obtains corresponding feature text;By pre-set first encoder by the only hot vector space of the feature text code to multidimensional, first eigenvector is obtained;The first eigenvector is encoded to term vector space by pre-set second encoder, obtains second feature vector;The second feature vector and the tag along sort are input in third encoder, iteration loss function, so that hidden layer vector is met same type text similarity greater than different type text similarity, obtain target code network;It is input to the target code network after the text to be processed is handled, obtains the text vector of the text to be processed.The application can enhance the characterization ability of text vector.

Description

A kind of text vector acquisition methods, device, computer equipment and storage medium
Technical field
The invention belongs to field of artificial intelligence more particularly to a kind of text vector acquisition methods, device, computer to set Standby and storage medium.
Background technique
Currently, natural language processing is an important directions in computer science and artificial intelligence field, with The rapid development of natural language processing technique, the basic research in natural language processing technique also increasingly obtain the weight of people Depending among these just including the research to how to generate text vector.In tasks such as text classification, text cluster, similarity calculations In, it requires to carry out vectorization transformation to text in advance, then carries out mathematics fortune with the original text of the text instead of vectorization It calculates and statistics, the data that natural language processing is identified for computer capacity is realized and carried out between people and computer with natural language Communication.In existing natural language processing, Sentence2Vec (sentence vector model) method is by all types of texts pair The content of text answered is put together by text-processing and is trained as corpus, and the acquisition of text feature is to pass through Word2vec (term vector model) output term vector simultaneously sums up average computation to the term vector, and the result for summing it up average is straight It connects as text vector.As it can be seen that there is a problem of that the characterization ability of text is low in the technology that existing text vector obtains.
Summary of the invention
The embodiment of the present invention provides a kind of text vector acquisition methods, it is intended to solve the technology that existing text vector obtains In, there is a problem of that the characterization ability of text is low.
The embodiments of the present invention are implemented as follows, provides a kind of text vector acquisition methods, comprising steps of
Text participle is carried out at least two different types of target texts for having carried out text-processing, obtains corresponding spy This solicit articles, wherein text includes tag along sort and content of text;
By pre-set first encoder by the only hot vector space of the feature text code to multidimensional, obtain described The first eigenvector of feature text;
The first eigenvector is encoded to term vector space by pre-set second encoder, obtains described The second feature vector of one feature vector;
The second feature vector and the tag along sort are input in third encoder, to the third encoder into Row training, the loss function of iteration third encoder make hidden layer vector in the third encoder meet same type text phase It is greater than different type text similarity like degree, obtains target code network;
Text to be processed is obtained, it will be defeated after the text progress text-processing and text participle to be processed Enter the text vector that the text to be processed is obtained to the target code network.
Further, described that the step of text-processing obtains target text is carried out at least two different types of texts Include:
Punctuation mark processing is removed to text, obtains the first text;
Capitalization is carried out to first text and turns small letter processing, obtains the second text;
Full-shape is carried out to second text and turns half-angle processing, obtains target text.
Further, described the step of carrying out text participle to target text, obtaining corresponding feature text, includes:
Word segmentation processing is carried out to the target text by segmenter, obtains word segmentation result;And
Word segmentation result is formed into feature text.
Further, it is described obtain word segmentation result after, comprising steps of
Being detected by pre-set deactivated dictionary whether there is stop words in the word segmentation result;
If it exists, then the stop words is deleted.
Further, described that the first eigenvector is encoded to by term vector by pre-set second encoder Space, the step of obtaining the second feature vector of the first eigenvector include:
By be set in advance in input layer in second encoder to hidden layer weight matrix, to the first eigenvector Dimensionality reduction is carried out, the second feature vector of the hidden layer is obtained.
Further, described that the second feature vector and the tag along sort are input in third encoder, it is right The third encoder be trained comprising steps of
The second feature vector and the tag along sort are input in noise reduction autocoder, to the second feature Vector carries out random damage, obtains third feature vector;
The noise reduction autocoder is trained based on the third feature vector.
Further, the loss function of the iteration third encoder makes hidden layer vector in the third encoder The step of meeting same type text similarity greater than different type text similarity, obtaining target code network include:
By the tag along sort, the inner product between the text vector of each text is calculated;
The inner product result of each text is compared, the similarity of each text is obtained;
According to the similarity of each text, the target code network is formed, wherein the target code network includes First encoder, second encoder and third encoder.
The present invention also provides a kind of text vector acquisition device, comprising:
Processing module obtains target text for carrying out text-processing at least two different types of texts, to described Target text carries out text participle, obtains corresponding feature text, wherein the text includes tag along sort and content of text;
First coding module, for by pre-set first encoder that the feature text code is solely warm to multidimensional Vector space obtains the first eigenvector of the feature text;
Second coding module, for by pre-set second encoder by the first eigenvector be encoded to word to Quantity space obtains the second feature vector of the first eigenvector;
Training module, for the second feature vector and the tag along sort to be input in third encoder, to institute It states third encoder to be trained, the loss function of iteration third encoder, keeps hidden layer vector in the third encoder full Sufficient same type text similarity is greater than different type text similarity, obtains target code network;
The text to be processed is carried out the text-processing and institute for obtaining text to be processed by input module It is input to the target code network after stating text participle, obtains the text vector of the text to be processed.
The present invention also provides a kind of computer equipment, including memory and processor, calculating is stored in the memory Machine program, the processor realize a kind of text as described in any one of claim one to seven when executing the computer program The step of this vector-obtaining method.
The present invention also provides a kind of computer readable storage mediums, which is characterized in that the computer readable storage medium On be stored with computer program, when the computer program is executed by processor realize such as any one of claim one to seven institute A kind of the step of text vector acquisition methods stated.
It is that the present invention realizes the utility model has the advantages that the present invention due to by the target text carry out text participle, be based on institute State the first encoder and the second encoder, by the feature text encoded to obtain the first eigenvector and The second feature vector (term vector) corresponding with the first eigenvector, and by the second feature vector and described point Class label is input to the third encoder and is trained, and allows the second feature vector to be damaged or pollute, by model training To similarity of the same type text similarity greater than different type text is met, so that the text vector made is more steady It is fixed, the characterization ability enhancing for the text vector being made of the term vector.
Detailed description of the invention
Fig. 1 is that this application can be applied to exemplary system architecture figures therein;
Fig. 2 is the flow chart of one embodiment of text vector acquisition methods provided in an embodiment of the present invention;
Fig. 3 is a kind of flow chart of specific embodiment of S201 in Fig. 2;
Fig. 4 is the flow chart of another specific embodiment of S201 in Fig. 2;
Fig. 5 is a kind of flow chart of specific embodiment of S401 in Fig. 4;
Fig. 6 is a kind of flow chart of specific embodiment of S203 in Fig. 2;
Fig. 7 is a kind of flow chart of specific embodiment of S204 in Fig. 2;
Fig. 8 is the flow chart of another specific embodiment of S204 in Fig. 2;
Fig. 9 is a kind of structural schematic diagram of text vector acquisition device provided in an embodiment of the present invention;
Figure 10 is a kind of structural schematic diagram of specific embodiment of processing module shown in Fig. 9;
Figure 11 is the structural schematic diagram of another specific embodiment of processing module shown in Fig. 9;
Figure 12 is the structural schematic diagram of another specific embodiment of processing module shown in Fig. 9;
Figure 13 is a kind of structural schematic diagram of specific embodiment of training module shown in Fig. 9;
Figure 14 is a kind of structural schematic diagram of specific embodiment of training module shown in Fig. 9;
Figure 15 is the structural schematic diagram of one embodiment of the computer equipment of the application.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
The present invention is due to being based on the first encoder and second encoder, inciting somebody to action by carrying out text participle to target text Feature text is encoded to obtain first eigenvector and second feature vector (term vector) corresponding with first eigenvector, And second feature vector is input to third encoder with tag along sort and is trained, allow second feature vector to be damaged or dirty Dye, by model training to similarity of the same type text similarity greater than different type text is met, thus the text made Vector is more stable, the characterization ability enhancing for the text vector being made of term vector.
As shown in Figure 1, system architecture 100 may include server 105, network 102 and terminal device 101,102,103. Network 104 between server 105 and terminal device 101,102,103 to provide the medium of communication link.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..Terminal device 101,102,103 can be with It is that can carry out the various electronic equipments that text is shown etc., including but not limited to intelligence with display screen with downloading application software Energy mobile phone, tablet computer, pocket computer on knee and desktop computer etc..Server 105 can be to provide various services Server, such as provide the page that shows on terminal device 101,102,103 background server supported.Client can make It is interacted by network 104 with server 105 with terminal device 101,102,103, to receive or obtain information etc..
It should be noted that a kind of text vector acquisition methods provided by the embodiment of the present application can be by server/end End equipment executes, and correspondingly, a kind of text vector acquisition device can be set in server/terminal equipment.
It should be understood that the number of mobile terminal, network in Fig. 1 and equipment is only schematical, according to realizing needs, It can have any number of mobile terminal, network and server.
As shown in Fig. 2, being the process of one embodiment according to provided by a kind of text vector acquisition methods of the application Figure.A kind of above-mentioned text vector acquisition methods, comprising steps of
S201 carries out text participles at least two different types of target texts for having carried out text-processing, obtains pair The feature text answered, wherein text includes tag along sort and content of text.
In the present embodiment, electronic equipment (such as the shifting shown in FIG. 1 of a kind of text vector acquisition methods operation thereon Dynamic terminal).Wherein, text can be to the text under information classification, such as: sports news, entertainment news under news category Text under equal columns;The text being also possible under classifying content, such as: in each forum in different plates in text;Certainly, It can also be the text of the text under insurance classification or other forms formed with natural language, in the embodiment of the present invention In, this is not construed as limiting.Tag along sort can be used for illustrating the type (being referred to as classification) of text, and content of text can be Refer to the text information formed with natural language text.
Specifically, above-mentioned text-processing can be understood as handling the content of text in text, by content of text The natural language text convenient for computer disposal is converted to, in order to increase the speed of computer disposal, to the content in text Format and character make corresponding processing, such as: text formatting is converted into TXT format, the unrelated word of removal etc..Wherein, to mesh It marks text and carries out text participle, participle tool can be used, text participle is carried out to text, obtaining corresponding feature text can be with It is interpreted as segmenting a text, available multiple phrases, and feature text includes multiple phrases.
S202 obtains spy by pre-set first encoder by the only hot vector space of feature text code to multidimensional The first eigenvector for sheet of soliciting articles.
In the present embodiment, the coding rule of above-mentioned first encoder can be user's self-defining, be also possible to using Online disclosed coding rule, above-mentioned first encoder can be one-hot (solely heat) encoder, can will be in feature text Participle is encoded into one-hot vector, to obtain first eigenvector.It is above-mentioned that by feature text code to multidimensional, solely hot vector is empty Between, it may be implemented quickly to the coding of feature text.
First eigenvector is encoded to term vector space by pre-set second encoder, obtains first by S203 The second feature vector of feature vector.
In embodiments of the present invention, in the way of deep learning, text conversion be multi-C vector space in vector into Row calculates.Above-mentioned second encoder can be word2vec, by the first eigenvector of obtained feature text as The input vector of word2vec is encoded, and first eigenvector can be mapped to term vector space, sought for text data More profound character representation.Continuous bag of words in word2vec can be used to carry out the first eigenvector of input in advance It surveys, to obtain second feature vector (term vector).
Second feature vector and tag along sort are input in third encoder, are trained to third encoder by S204, The loss function of iteration third encoder makes hidden layer vector in third encoder meet same type text similarity greater than difference Type text similarity obtains target code network.
In embodiments of the present invention, hidden layer can be not provided with activation primitive (Activation Function), i.e., only need Want the feature vector of hidden layer, the loss function of third encoder can be with are as follows:
Wherein, function LR() is the squared error function of basic loss function, LR(yn, xn)=‖ yn-xn‖ ^2, LT(h0, h1, h2)=Sim (h0, h1)-Sim(h0, h2), α is the real number between 0 to 1, and Sim () is interior Product function.By tag along sort, until Calculated result allows hidden layer vector in third encoder to meet same type text similarity greater than different type text similarity, example Such as: X0With X1For same text type, similarity 80%, X2With X0For different text types, similarity 1%.Certainly, phase It can also refer to distance like degree, such as: Beijing is same text type with Tianjin, and Beijing is different text types from Xinjiang.This The characterization ability of text vector can be enhanced in sample.
S205 obtains text to be processed, is input to mesh after text to be processed is carried out text-processing and text participle Coding network is marked, the text vector of text to be processed is obtained.
In embodiments of the present invention, text to be processed can be the text for needing that feature extraction is carried out according to text information This, can be newly-increased text, such as the text that user newly uploads or newly grabs.The hidden layer of third encoder does not activate letter Number, can be by target code network in third encoder the available text to be processed of hidden layer text vector, Be both by will be completed text-processing and text participle after text to be processed, be input to complete training and can be complete In the third encoder of constituent class, the text vector with categorical attribute is obtained.
The present invention is due to by carrying out text participle, the mind based on the first encoder and second encoder to target text Through network, feature text is encoded to obtain first eigenvector and second feature vector corresponding with first eigenvector (term vector), and second feature vector is input to third encoder with tag along sort and is trained, allows second feature vector quilt Model training, is arrived the similarity for meeting same type text similarity greater than different type text, then will acquire by damage or pollution After the text to be processed arrived carries out text-processing and text participle, be input to completed in trained third encoder into Row coding, so that the text vector made is more stable, the characterization ability for the text vector that enhancing term vector is constituted.
Further, as shown in figure 3, the step of S201 includes:
S301 is removed punctuation mark processing to text, obtains the first text;
S302 carries out capitalization to the first text and turns small letter processing, obtains the second text;
S303 carries out full-shape to the second text and turns half-angle processing, obtains target text.
Wherein it is possible to by the regular expression of " canonical matching " come processing character string, can with some specific characters come The rule that character occurs in character string is described, to match, extract or replace the character string for meeting some rule, can also be used It searches, delete and substitute character string, search speed is fast and precisely.
Specifically, text is matched with character expression, when being matched in text there are when punctuation mark, to mark Point symbol carries out delete processing, the first text after obtaining delete processing, wherein character expression can refer to for matching text The regular expression of punctuation mark in this, specific regular expression can be " pP+~$ `^=|<>~` $ ^+=| < > $ ×] $ ".Such as: there are texts are as follows: [health], during sleep at night, these " performances " occurs in body, it may be possible to which body is suffered from There is disease.By with character expression " pP+~$ `^=|<>~` $ ^+=| < > $ ×] $ " text is matched, Obtain in text that there are symbols are as follows: " [] ", ", " " " " ", "!", then by symbol " [] ", ", " " " " ", "!" etc. deleted after, obtain To treated the first text are as follows: these performances occurs to health in body during sleep at night may be body with disease.
More specifically, according to the first obtained text, each character in the first text is traversed, with alphabetical converting expressing Formula matches each character in the first text, if be matched to character be capitalization when, can be capitalization It is converted into lowercase character, after the completion of all character match, using the first text after the completion of matching as the second text.Its In, alphabetical transformed representation can refer to dedicated for the capitalization in the first title of matching, and capitalization is converted into The regular expression of lowercase, specific regular expression can be $ reg='/(w+)/e'.It is then possible to be by Two texts imported into progress half-angle conversion process, the target text after obtaining conversion process in preset transformation warehouse, wherein default Transformation warehouse can refer to the double byte character in the second text for identification, and double byte character is converted into the data of half-angle character Library specifically can be used canonical matching and be handled, the script pre-set also can be used and handled.
In this way, by carrying out punctuation mark deletion to text, primary and secondary capitalization turns small letter and full-shape turns the processing of half-angle, from And a target text is obtained, the processing speed of computer can be enhanced.
Further, as shown in figure 4, the step of S201 further include:
S401 carries out word segmentation processing to target text by segmenter, obtains word segmentation result;And
Word segmentation result is formed feature text by S402.
In embodiments of the present invention, it can be and target text imported into jieba (stammerer) segmenter, selection participle mould Formula is segmented, and participle mode may include syntype, accurate mode, new word identification, search engine mode etc., wherein neologisms Identification can customized addition neologisms, the preferred accurate model of present embodiment segmented.Such as: if target text are as follows: the Forbidden City Famous sites include Palace of Heavenly Purity, the Hall of Supreme Harmony and yellow glazed tiles etc., available word segmentation result is segmented by accurate model: therefore Palace// famous sites/includes/and the dry/palace of the Qing Dynasty/the Hall of Supreme Harmony/and/Huang/glazed tiles/etc..The word segmentation result obtained from can be used as One feature text.
In this way, carrying out participle stars feature text corresponding with target text, Ke Yiyou to target text by segmenter Feature text is encoded conducive to encoder.
Further, as shown in figure 5, after S401, comprising steps of
S501, being detected by pre-set deactivated dictionary whether there is stop words in word segmentation result;
S502, and if it exists, then stop words is deleted.
In embodiments of the present invention, it can be and obtain all stop words from default deactivated dictionary, then can will segment As a result each participle in is compared with stop words, includes and at least one phase in stop words when being matched in word segmentation result Meanwhile by participle identical with stop words in word segmentation result carry out delete processing, and by execute delete processing after word segmentation result As the word segmentation result for indicating text feature;It can also be when detecting in word segmentation result there is no stop words, it can be direct Using word segmentation result as the word segmentation result for indicating text feature.Wherein, default deactivated dictionary, which refers to, can be used for storing stop words Database.
In this way, further being deleted in word segmentation result by the way that word segmentation result to be compared with the stop words in deactivated dictionary The stop words of appearance is conducive to get the word segmentation result with more text feature.
Further, as shown in fig. 6, the step of S203 includes:
S601, by be set in advance in input layer in second encoder to hidden layer weight matrix, to fisrt feature to Amount carries out dimensionality reduction, obtains the second feature vector of hidden layer.
In embodiments of the present invention, the dimensionality reduction to first eigenvector may be implemented in weight matrix, and weight matrix can be pre- The input layer of second encoder is first set between hidden layer.First eigenvector is subjected to dimensionality reduction, obtain second feature to Amount, second feature vector can be indicated, term vector may include dimension for indicating term vector and quantity with the form of matrix Angle value (columns indicate number of dimensions) in matrix, word quantity can be the quantity of term vector, and (word quantity i.e. in dictionary is Line number in matrix.
In this way, dimension disaster can be reduced by the way that second feature vector coding is carried out dimensionality reduction into weight matrix, reduce Calculation amount.
Further, as shown in fig. 7, S204 comprising steps of
Second feature vector and tag along sort are input in noise reduction autocoder by S701, to second feature vector into Row random damage obtains third feature vector;
S702 is trained noise reduction autocoder based on third feature vector.
In embodiments of the present invention, second feature vector (term vector) and tag along sort are input to noise reduction autocoder In handled so that term vector is contaminated or random damage, be contaminated or random damage after term vector as third feature Then vector is trained noise reduction autocoder by third feature vector.
In this way, be damaged in term vector or pollute the target code network trained, obtained text vector with regard to more stable, The robustness of noise reduction autocoder and coding network can be increased, to improve the robustness of entire target code network.
Further, as shown in figure 8, the step of S204 further include:
S801 calculates the inner product between the text vector of each text by tag along sort;
S802 is compared the inner product result of each text, obtains the similarity of each text;
S803 forms target code network according to the similarity of each text, wherein target code network includes the first volume Code device, second encoder and third encoder.
In embodiments of the present invention, it can make sentence vector that there is text type attribute by tag along sort, it is above-mentioned Inner product between text can be used to indicate that the similarity between each text, such as: it is respectively x there are three types of text0、x1、x2, Wherein, x0、x1For same type text, x0、x2For different type text, then the corresponding spy in the hidden layer of third encoder Levying vector (text vector) is respectively h0、h1、h2, weight is adjusted by training, so that h0With h1Similarity be greater than h0With h2's Similarity:
Sim(h0, h1) > Sim (h0, h2)
That is x0For target text, x1For with x0The text of same type, X2For with X0Different types of text, as each Text X finds at least one same type and different types of text.Such as: X0 is jpg format, and X1 is jpg format, and X2 is Docx format.
In this way, there is text type attribute by sentence vector according to tag along sort, classification then is completed by calculating Similarity between each text and target text finds same type text and different type text with target text, makes Obtain the characterization ability enhancing of text.
The present invention is since by carrying out to text, punctuation mark removal, capitalization turns small letter and full-shape turns the processing shape of half-angle At target text, and then target text is carried out to form feature text, base after the deletion of text word segmentation processing and stop words Feature text code is exported into first eigenvector to the hot vector space of various dimensions in the first encoder, is then based on the second coding The neural network of device exports corresponding second feature after the weight matrix that first eigenvector is input to hidden layer is carried out dimensionality reduction Then second feature vector is input to noise reduction autocoder with tag along sort and is trained by vector (term vector), allow second Feature vector is calculated the inner product between text, is compared inner product as a result, model training is similar to meeting by random damage or pollution Type text similarity is greater than the similarity of different type text, and the text made has text type attribute, then will acquire Text to be processed carry out text-processing and text participle after, be input to and completed in trained noise reduction autocoder Encoded, allow corresponding sentence vector have text type attribute so that from the text that corresponding term vector is constituted to Measure it is more stable, enhance term vector composition text vector characterization ability.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, which can be stored in a computer-readable storage and be situated between In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, storage medium above-mentioned can be The non-volatile memory mediums such as magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random storage note Recall body (Random Access Memory, RAM) etc..
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other At least part of the sub-step or stage of step or other steps executes in turn or alternately.
As shown in figure 9, for a kind of structural schematic diagram of text vector acquisition device provided by the present embodiment, above-mentioned apparatus 900 include: processing module 901, the first coding module 902, the second coding module 903, training module 904, input module 905. Wherein:
Processing module 901 obtains target text for carrying out text-processing at least two different types of texts, to institute It states target text and carries out text participle, obtain corresponding feature text, wherein the text includes in tag along sort and text Hold;
First coding module 902, for by pre-set first encoder that feature text code is solely warm to multidimensional Vector space obtains the first eigenvector of feature text;
Second coding module 903, for by pre-set second encoder by first eigenvector be encoded to word to Quantity space obtains the second feature vector of first eigenvector;
Training module 904 encodes third for second feature vector and tag along sort to be input in third encoder Device is trained, the loss function of iteration third encoder, and hidden layer vector in third encoder is made to meet same type text phase It is greater than different type text similarity like degree, obtains target code network;
Text to be processed is carried out text-processing and text segments by input module 905 for obtaining text to be processed After be input to target code network, obtain the text vector of text to be processed.
It further, as shown in Figure 10, is a kind of structural schematic diagram of specific embodiment of processing module 901, comprising: First processing submodule 9011, second processing submodule 9012, third handle submodule 9013.Wherein,
First processing submodule 9011 obtains the first text for being removed punctuation mark processing to text;
Second processing submodule 9012 turns small letter processing for carrying out capitalization to the first text, obtains the second text;
Third handles submodule 9013, turns half-angle processing for carrying out full-shape to the second text, obtains target text.
It further, as shown in figure 11, is the structural schematic diagram of another specific embodiment of processing module 901, also It include: that participle submodule 9014, first generates submodule 9015.Wherein,
Submodule 9014 is segmented, for carrying out word segmentation processing to target text by segmenter, obtains word segmentation result;With And;
First generates submodule 9015, for word segmentation result to be formed feature text.
It further, as shown in figure 12, is the structural schematic diagram of another specific embodiment of processing module 901, also Include: detection sub-module 9016, delete submodule 9017.Wherein,
Detection sub-module 9016, for being detected in word segmentation result by pre-set deactivated dictionary with the presence or absence of deactivated Word;
Submodule 9017 is deleted, for if it exists, then deleting stop words.
Further, the second coding module 903 is also used to by being set in advance in second encoder input layer to implicit The weight matrix of layer carries out dimensionality reduction to first eigenvector, obtains the second feature vector of hidden layer.
It further, as shown in figure 13, is a kind of structural schematic diagram of specific embodiment of training module 904, comprising: Input submodule 9041, training submodule 9042.Wherein,
Input submodule 9041 is right for second feature vector and tag along sort to be input in noise reduction autocoder Second feature vector carries out random damage, obtains third feature vector;
Training submodule 9042, for being trained based on third feature vector to noise reduction autocoder.
It further, as shown in figure 14, is the structural schematic diagram of another specific embodiment of training module 904, packet Include: computational submodule 9043, Comparative sub-module 9044, second generate submodule 9045.Wherein,
Computational submodule 9043, for calculating the inner product between the text vector of each text by tag along sort;
Comparative sub-module 9044 is compared for the inner product result to each text, obtains the similarity of each text
Second generates submodule 9045, for the similarity according to each text, forms target code network, wherein target Coding network includes the first encoder, second encoder and third encoder.
A kind of text vector acquisition device provided by the embodiments of the present application can be realized in the embodiment of the method for Fig. 2 to Fig. 8 Each embodiment and corresponding beneficial effect, to avoid repeating, which is not described herein again.
In order to solve the above technical problems, the embodiment of the present application also provides computer equipment.Referring specifically to Figure 15, Tu15Wei The present embodiment computer equipment basic structure block diagram.
Computer equipment 15 includes that connection memory 151, processor 152, network interface are in communication with each other by system bus 153.It should be pointed out that the computer equipment 15 with component 151-153 is illustrated only in figure, it should be understood that simultaneously All components shown realistic are not applied, the implementation that can be substituted is more or less component.Wherein, the art technology Personnel are appreciated that computer equipment here is that one kind can carry out automatically numerical value according to the instruction for being previously set or storing The equipment of calculating and/or information processing, hardware includes but is not limited to microprocessor, specific integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable GateArray, FPGA), digital processing unit (Digital Signal Processor, DSP), embedded device etc..
Computer equipment can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment.Meter Human-computer interaction can be carried out by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with client by calculating machine equipment.
Memory 151 includes at least a type of readable storage medium storing program for executing, and readable storage medium storing program for executing includes flash memory, hard disk, more Media card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..In some embodiments, memory 151 can be the interior of computer equipment 15 Portion's storage unit, such as the hard disk or memory of the computer equipment 15.In further embodiments, memory 151 is also possible to The plug-in type hard disk being equipped on the External memory equipment of computer equipment 15, such as the computer equipment 15, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..When So, memory 151 can also both including computer equipment 15 internal storage unit and also including its External memory equipment.This implementation In example, memory 151 is installed on the operating system and types of applications software of computer equipment 15 commonly used in storage, such as a kind of The program code etc. of text vector acquisition methods.It has exported or has incited somebody to action in addition, memory 151 can be also used for temporarily storing The Various types of data to be exported.
Processor 152 can be in some embodiments central processing unit (Central Processing Unit, CPU), Controller, microcontroller, microprocessor or other data processing chips.The processor 152 is commonly used in control computer equipment 15 overall operation.In the present embodiment, program code or processing number of the processor 152 for being stored in run memory 151 According to, such as run a kind of program code of text vector acquisition methods.
Network interface 153 may include radio network interface or wired network interface, which is commonly used in counting It calculates to establish between machine equipment 15 and other electronic equipments and communicate to connect.
Present invention also provides another embodiments, that is, provide a kind of computer readable storage medium, computer-readable Storage medium is stored with a kind of text vector and obtains program, and a kind of above-mentioned text vector obtains program can be by least one processor It executes, so that at least one processor is executed such as the step of a kind of above-mentioned text vector acquisition methods.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, the technical solution of the application substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, computer, clothes Be engaged in device, air conditioner or the network equipment etc.) execute each embodiment of the application a kind of text vector acquisition methods.
Term " includes " in the description and claims of this application and above-mentioned Detailed description of the invention and " having " and it Any deformation, it is intended that cover and non-exclusive include.In the description and claims of this application or above-mentioned attached drawing Term " first ", " second " etc. be to be not use to describe a particular order for distinguishing different objects.It is referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described may be embodied at least one of the application in conjunction with the embodiments In embodiment.The phrase, which occurs, in each position in the description might not each mean identical embodiment, nor and its The independent or alternative embodiment of its embodiment mutual exclusion.Those skilled in the art explicitly and implicitly understand, herein Described embodiment can be combined with other embodiments.
The above is merely preferred embodiments of the present invention, be not intended to limit the invention, it is all in spirit of the invention and Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within principle.

Claims (10)

1. a kind of text vector acquisition methods, which is characterized in that comprising steps of
Text-processing is carried out at least two different types of texts and obtains target text, text point is carried out to the target text Word obtains corresponding feature text, wherein the text includes tag along sort and content of text;
By pre-set first encoder by the only hot vector space of the feature text code to multidimensional, the feature is obtained The first eigenvector of text;
The first eigenvector is encoded to term vector space by pre-set second encoder, it is special to obtain described first Levy the second feature vector of vector;
The second feature vector and the tag along sort are input in third encoder, the third encoder is instructed Practice, the loss function of iteration third encoder makes hidden layer vector in the third encoder meet same type text similarity Greater than different type text similarity, target code network is obtained;
Text to be processed is obtained, is input to after the text to be processed is carried out the text-processing and text participle The target code network obtains the text vector of the text to be processed.
2. a kind of text vector acquisition methods according to claim 1, which is characterized in that described at least two inhomogeneities The text of type carries out the step of text-processing obtains target text and includes:
Punctuation mark processing is removed to text, obtains the first text;
Capitalization is carried out to first text and turns small letter processing, obtains the second text;
Full-shape is carried out to second text and turns half-angle processing, obtains target text.
3. a kind of text vector acquisition methods according to claim 1, which is characterized in that described to carry out text to target text This participle, the step of obtaining corresponding feature text include:
Word segmentation processing is carried out to the target text by segmenter, obtains word segmentation result;And
Word segmentation result is formed into feature text.
4. a kind of text vector acquisition methods according to claim 3, which is characterized in that it is described obtain word segmentation result it Afterwards, comprising steps of
Being detected by pre-set deactivated dictionary whether there is stop words in the word segmentation result;
If it exists, then the stop words is deleted.
5. a kind of text vector acquisition methods according to claim 1, which is characterized in that described to pass through pre-set The first eigenvector is encoded to term vector space by two encoders, obtains the second feature vector of the first eigenvector The step of include:
The first eigenvector is carried out to the weight matrix of hidden layer by being set in advance in input layer in second encoder Dimensionality reduction obtains the second feature vector of the hidden layer.
6. a kind of text vector acquisition methods according to claim 1, which is characterized in that it is described by the second feature to Amount is input in third encoder with the tag along sort, the third encoder is trained comprising steps of
The second feature vector and the tag along sort are input in noise reduction autocoder, to the second feature vector Random damage is carried out, third feature vector is obtained;
The noise reduction autocoder is trained based on the third feature vector.
7. a kind of text vector acquisition methods according to claim 1, which is characterized in that the iteration third encoder Loss function makes in the third encoder hidden layer vector meet same type text similarity similar greater than different type text Degree, the step of obtaining target code network include:
By the tag along sort, the inner product between the text vector of each text is calculated;
The inner product result of each text is compared, the similarity of each text is obtained;
According to the similarity of each text, the target code network is formed, wherein the target code network includes described First encoder, second encoder and third encoder.
8. a kind of text vector acquisition device characterized by comprising
Processing module carries out text-processing at least two different types of texts and obtains target text, to the target text Text participle is carried out, obtains corresponding feature text, wherein the text includes tag along sort and content of text;
First coding module, for by pre-set first encoder by the only hot vector of the feature text code to multidimensional Space obtains the first eigenvector of the feature text;
Second coding module, for the first eigenvector to be encoded to term vector sky by pre-set second encoder Between, obtain the second feature vector of the first eigenvector;
Training module, for the second feature vector and the tag along sort to be input in third encoder, to described the Three encoders are trained, the loss function of iteration third encoder, meet hidden layer vector in the third encoder together Type text similarity is greater than different type text similarity, obtains target code network;
The text to be processed is carried out the text-processing and the text for obtaining text to be processed by input module It is input to the target code network after this participle, obtains the text vector of the text to be processed.
9. a kind of computer equipment, including memory and processor, computer program, the processing are stored in the memory Device realizes a kind of text vector acquisition methods as described in any one of claims 1 to 7 when executing the computer program Step.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program realizes a kind of text vector as described in any one of claims 1 to 7 when the computer program is executed by processor The step of acquisition methods.
CN201910637101.2A 2019-07-15 2019-07-15 Text vector acquisition method and device, computer equipment and storage medium Active CN110532381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910637101.2A CN110532381B (en) 2019-07-15 2019-07-15 Text vector acquisition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910637101.2A CN110532381B (en) 2019-07-15 2019-07-15 Text vector acquisition method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110532381A true CN110532381A (en) 2019-12-03
CN110532381B CN110532381B (en) 2023-09-26

Family

ID=68660195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910637101.2A Active CN110532381B (en) 2019-07-15 2019-07-15 Text vector acquisition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110532381B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990837A (en) * 2020-02-29 2020-04-10 网御安全技术(深圳)有限公司 System call behavior sequence dimension reduction method, system, equipment and storage medium
CN111079442A (en) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 Vectorization representation method and device of document and computer equipment
CN111445545A (en) * 2020-02-27 2020-07-24 北京大米未来科技有限公司 Text-to-map method, device, storage medium and electronic equipment
CN112528681A (en) * 2020-12-18 2021-03-19 北京百度网讯科技有限公司 Cross-language retrieval and model training method, device, equipment and storage medium
CN112749530A (en) * 2021-01-11 2021-05-04 北京光速斑马数据科技有限公司 Text encoding method, device, equipment and computer readable storage medium
WO2021134416A1 (en) * 2019-12-31 2021-07-08 深圳市优必选科技股份有限公司 Text transformation method and apparatus, computer device, and computer readable storage medium
WO2021143020A1 (en) * 2020-01-14 2021-07-22 平安科技(深圳)有限公司 Bad term recognition method and device, electronic device, and storage medium
CN115047894A (en) * 2022-04-14 2022-09-13 中国民用航空总局第二研究所 Unmanned aerial vehicle track measuring and calculating method, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121801A1 (en) * 2016-10-28 2018-05-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for classifying questions based on artificial intelligence
CN109408702A (en) * 2018-08-29 2019-03-01 昆明理工大学 A kind of mixed recommendation method based on sparse edge noise reduction autocoding
CN109582786A (en) * 2018-10-31 2019-04-05 中国科学院深圳先进技术研究院 A kind of text representation learning method, system and electronic equipment based on autocoding
CN109885826A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Text term vector acquisition methods, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121801A1 (en) * 2016-10-28 2018-05-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for classifying questions based on artificial intelligence
CN109408702A (en) * 2018-08-29 2019-03-01 昆明理工大学 A kind of mixed recommendation method based on sparse edge noise reduction autocoding
CN109582786A (en) * 2018-10-31 2019-04-05 中国科学院深圳先进技术研究院 A kind of text representation learning method, system and electronic equipment based on autocoding
CN109885826A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Text term vector acquisition methods, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张素智 等: ""面向聚类的堆叠降噪自动编码器的特征提取研究"", 《现代计算机》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079442A (en) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 Vectorization representation method and device of document and computer equipment
CN111079442B (en) * 2019-12-20 2021-05-18 北京百度网讯科技有限公司 Vectorization representation method and device of document and computer equipment
US11403468B2 (en) 2019-12-20 2022-08-02 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for generating vector representation of text, and related computer device
WO2021134416A1 (en) * 2019-12-31 2021-07-08 深圳市优必选科技股份有限公司 Text transformation method and apparatus, computer device, and computer readable storage medium
WO2021143020A1 (en) * 2020-01-14 2021-07-22 平安科技(深圳)有限公司 Bad term recognition method and device, electronic device, and storage medium
CN111445545A (en) * 2020-02-27 2020-07-24 北京大米未来科技有限公司 Text-to-map method, device, storage medium and electronic equipment
CN111445545B (en) * 2020-02-27 2023-08-18 北京大米未来科技有限公司 Text transfer mapping method and device, storage medium and electronic equipment
CN110990837B (en) * 2020-02-29 2023-03-24 网御安全技术(深圳)有限公司 System call behavior sequence dimension reduction method, system, equipment and storage medium
CN110990837A (en) * 2020-02-29 2020-04-10 网御安全技术(深圳)有限公司 System call behavior sequence dimension reduction method, system, equipment and storage medium
CN112528681A (en) * 2020-12-18 2021-03-19 北京百度网讯科技有限公司 Cross-language retrieval and model training method, device, equipment and storage medium
CN112749530A (en) * 2021-01-11 2021-05-04 北京光速斑马数据科技有限公司 Text encoding method, device, equipment and computer readable storage medium
CN112749530B (en) * 2021-01-11 2023-12-19 北京光速斑马数据科技有限公司 Text encoding method, apparatus, device and computer readable storage medium
CN115047894A (en) * 2022-04-14 2022-09-13 中国民用航空总局第二研究所 Unmanned aerial vehicle track measuring and calculating method, electronic equipment and storage medium
CN115047894B (en) * 2022-04-14 2023-09-15 中国民用航空总局第二研究所 Unmanned aerial vehicle track measuring and calculating method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110532381B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN110532381A (en) A kind of text vector acquisition methods, device, computer equipment and storage medium
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
CN110909548A (en) Chinese named entity recognition method and device and computer readable storage medium
CN104834747A (en) Short text classification method based on convolution neutral network
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN110866098B (en) Machine reading method and device based on transformer and lstm and readable storage medium
CN105843796A (en) Microblog emotional tendency analysis method and device
WO2020056977A1 (en) Knowledge point pushing method and device, and computer readable storage medium
CN110738049B (en) Similar text processing method and device and computer readable storage medium
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
WO2021051934A1 (en) Method and apparatus for extracting key contract term on basis of artificial intelligence, and storage medium
CN113505601A (en) Positive and negative sample pair construction method and device, computer equipment and storage medium
CN112084342A (en) Test question generation method and device, computer equipment and storage medium
CN111488732A (en) Deformed keyword detection method, system and related equipment
CN112329463A (en) Training method of remote monitoring relation extraction model and related device
CN110222144B (en) Text content extraction method and device, electronic equipment and storage medium
CN110019674A (en) A kind of text plagiarizes detection method and system
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
Kokane et al. Word sense disambiguation: a supervised semantic similarity based complex network approach
CN110363206A (en) Cluster, data processing and the data identification method of data object
CN110321565B (en) Real-time text emotion analysis method, device and equipment based on deep learning
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
Nguyen et al. A feature-word-topic model for image annotation
CN114722774B (en) Data compression method, device, electronic equipment and storage medium
JP7236501B2 (en) Transfer learning method and computer device for deep learning model based on document similarity learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant