CN110532381B - Text vector acquisition method and device, computer equipment and storage medium - Google Patents

Text vector acquisition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN110532381B
CN110532381B CN201910637101.2A CN201910637101A CN110532381B CN 110532381 B CN110532381 B CN 110532381B CN 201910637101 A CN201910637101 A CN 201910637101A CN 110532381 B CN110532381 B CN 110532381B
Authority
CN
China
Prior art keywords
text
vector
encoder
feature vector
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910637101.2A
Other languages
Chinese (zh)
Other versions
CN110532381A (en
Inventor
唐亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN201910637101.2A priority Critical patent/CN110532381B/en
Publication of CN110532381A publication Critical patent/CN110532381A/en
Application granted granted Critical
Publication of CN110532381B publication Critical patent/CN110532381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application is applicable to the field of artificial intelligence, and provides a text vector acquisition method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: performing text processing on the text to obtain a target text, and performing text word segmentation on the target text to obtain a corresponding feature text; encoding the characteristic text into a multidimensional independent heat vector space through a preset first encoder to obtain a first characteristic vector; encoding the first feature vector into a word vector space through a preset second encoder to obtain a second feature vector; inputting the second feature vector and the classification label into a third encoder, and iterating a loss function to enable the hidden layer vector to meet that the similarity of the same type of text is greater than that of different types of text, so as to obtain a target coding network; and processing the text to be processed and inputting the processed text to the target coding network to obtain a text vector of the text to be processed. The application can enhance the representation capability of the text vector.

Description

Text vector acquisition method and device, computer equipment and storage medium
Technical Field
The application belongs to the technical field of artificial intelligence, and particularly relates to a text vector acquisition method, a text vector acquisition device, computer equipment and a storage medium.
Background
At present, natural language processing is an important direction in the fields of computer science and artificial intelligence, and along with rapid development of natural language processing technology, basic research in natural language processing technology is also getting more and more attention, which includes research on how to generate text vectors. In the tasks of text classification, text clustering, similarity calculation and the like, the text needs to be vectorized in advance, then the vectorized text is used for carrying out mathematical operation and statistics instead of the original text, and natural language is processed into data which can be recognized by a computer, so that communication between people and the computer is realized by using the natural language. In the existing natural language processing, the method of Sentence2Vec (Sentence vector model) is to put text contents corresponding to all types of texts together through text processing as a corpus for training, and the text features are obtained by outputting word vectors through word2Vec (word vector model) and carrying out addition and average calculation on the word vectors, and the addition and average result is directly used as the text vectors. It can be seen that the existing text vector acquisition technology has the problem of low text characterization capability.
Disclosure of Invention
The embodiment of the application provides a text vector acquisition method, which aims to solve the problem that the representation capability of a text is low in the existing text vector acquisition technology.
The embodiment of the application is realized in such a way that a text vector acquisition method is provided, which comprises the following steps:
performing text word segmentation on at least two different types of target texts subjected to text processing to obtain corresponding characteristic texts, wherein the texts comprise classification labels and text contents;
encoding the characteristic text into a multidimensional independent heat vector space through a preset first encoder to obtain a first characteristic vector of the characteristic text;
encoding the first feature vector into a word vector space through a preset second encoder to obtain a second feature vector of the first feature vector;
inputting the second feature vector and the classification label into a third encoder, training the third encoder, iterating a loss function of the third encoder, and enabling hidden layer vectors in the third encoder to meet that the similarity of the same type of text is larger than that of different types of text, so as to obtain a target coding network;
and acquiring a text to be processed, performing text processing and text word segmentation on the text to be processed, and inputting the text to be processed into the target coding network to obtain a text vector of the text to be processed.
Further, the step of performing text processing on at least two different types of texts to obtain target texts includes:
performing punctuation mark removal processing on the text to obtain a first text;
performing capitalization and lowercase treatment on the first text to obtain a second text;
and performing full-angle and half-angle conversion on the second text to obtain a target text.
Further, the step of performing text segmentation on the target text to obtain the corresponding feature text includes:
performing word segmentation processing on the target text through a word segmentation device to obtain a word segmentation result; and
and forming the word segmentation result into characteristic text.
Further, after the word segmentation result is obtained, the method comprises the following steps:
detecting whether a stop word exists in the word segmentation result through a preset stop word library;
and if so, deleting the stop word.
Further, the step of encoding the first feature vector into a word vector space by a preset second encoder to obtain a second feature vector of the first feature vector includes:
and reducing the dimension of the first feature vector by presetting a weight matrix from an input layer to an hidden layer in a second encoder to obtain a second feature vector of the hidden layer.
Still further, the inputting the second feature vector and the classification tag into a third encoder, and training the third encoder includes the steps of:
inputting the second feature vector and the classification label into a noise reduction automatic encoder, and randomly damaging the second feature vector to obtain a third feature vector;
training the noise reduction automatic encoder based on the third feature vector.
Further, the step of iterating the loss function of the third encoder to make the hidden layer vector in the third encoder satisfy that the similarity of the same type of text is greater than the similarity of different types of text, and the step of obtaining the target coding network includes:
calculating an inner product among text vectors of each text through the classification labels;
comparing the inner product results of the texts to obtain the similarity of the texts;
and forming the target coding network according to the similarity of the texts, wherein the target coding network comprises the first encoder, the second encoder and the third encoder.
The application also provides a text vector acquisition device, which comprises:
the processing module is used for carrying out text processing on at least two different types of texts to obtain a target text, and carrying out text word segmentation on the target text to obtain a corresponding characteristic text, wherein the text comprises a classification label and text content;
the first coding module is used for coding the characteristic text into a multidimensional independent hot vector space through a preset first coder to obtain a first characteristic vector of the characteristic text;
the second coding module is used for coding the first feature vector into a word vector space through a preset second coder to obtain a second feature vector of the first feature vector;
the training module is used for inputting the second feature vector and the classification label into a third encoder, training the third encoder, iterating a loss function of the third encoder, and enabling hidden layer vectors in the third encoder to meet that the similarity of texts of the same type is larger than that of texts of different types, so that a target coding network is obtained;
the input module is used for acquiring a text to be processed, carrying out text processing and text word segmentation on the text to be processed, and inputting the text to be processed into the target coding network to obtain a text vector of the text to be processed.
The application also provides a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of a text vector retrieving method as claimed in any one of claims one to seven when the computer program is executed.
The application also provides a computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a text vector retrieving method according to any of the claims one to seven.
The beneficial effects realized by the application are as follows: according to the application, the target text is subjected to text word segmentation, the characteristic text is encoded based on the first encoder and the second encoder to obtain the first characteristic vector and the second characteristic vector (word vector) corresponding to the first characteristic vector, the second characteristic vector and the classification label are input into the third encoder for training, so that the second characteristic vector is damaged or polluted, and the model is trained to satisfy the condition that the similarity of the texts of the same type is greater than that of the texts of different types, so that the obtained text vector is more stable, and the characterization capability of the text vector formed by the word vector is enhanced.
Drawings
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a text vector retrieval method provided by an embodiment of the present application;
FIG. 3 is a flow chart of one embodiment of S201 in FIG. 2;
FIG. 4 is a flow chart of another embodiment of S201 in FIG. 2;
FIG. 5 is a flow chart of one embodiment of S401 in FIG. 4;
FIG. 6 is a flow chart of one embodiment of S203 in FIG. 2;
FIG. 7 is a flow chart of one embodiment of S204 in FIG. 2;
FIG. 8 is a flow chart of another embodiment of S204 in FIG. 2;
fig. 9 is a schematic structural diagram of a text vector obtaining device according to an embodiment of the present application;
FIG. 10 is a schematic diagram of one embodiment of the processing module shown in FIG. 9;
FIG. 11 is a schematic diagram of another embodiment of the process module shown in FIG. 9;
FIG. 12 is a schematic diagram of another embodiment of the process module shown in FIG. 9;
FIG. 13 is a schematic diagram of one embodiment of the training module shown in FIG. 9;
FIG. 14 is a schematic diagram of one embodiment of the training module shown in FIG. 9;
FIG. 15 is a schematic diagram of the architecture of one embodiment of a computer device of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
According to the method, the target text is subjected to text word segmentation, the characteristic text is encoded based on the first encoder and the second encoder to obtain the first characteristic vector and the second characteristic vector (word vector) corresponding to the first characteristic vector, the second characteristic vector and the classification label are input into the third encoder for training, the second characteristic vector is damaged or polluted, and the model is trained to meet the condition that the similarity of the texts of the same type is greater than that of the texts of different types, so that the obtained text vector is more stable, and the characterization capability of the text vector formed by the word vector is enhanced.
As shown in fig. 1, the system architecture 100 may include a server 105, a network 102, and terminal devices 101, 102, 103. The network 104 is used as a medium to provide communication links between the server 105 and the terminal devices 101, 102, 103. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables. The terminal devices 101, 102, 103 may be various electronic devices with display screens, downloadable application software, text display, etc., including but not limited to smartphones, tablet computers, laptop and desktop computers, etc. The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103. The client can interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or obtain information or the like.
It should be noted that, the text vector obtaining method provided by the embodiment of the present application may be executed by a server/terminal device, and accordingly, a text vector obtaining apparatus may be provided in the server/terminal device.
It should be understood that the number of mobile terminals, networks and devices in fig. 1 is merely illustrative and that any number of mobile terminals, networks and servers may be provided as desired for implementation.
As shown in fig. 2, a flowchart of one embodiment of a text vector retrieving method according to the present application is provided. The text vector obtaining method comprises the following steps:
s201, performing text word segmentation on at least two different types of target texts subjected to text processing to obtain corresponding characteristic texts, wherein the texts comprise classification labels and text contents.
In this embodiment, an electronic device (e.g., a mobile terminal shown in fig. 1) on which a text vector obtaining method operates. The text may be text under the classification of information, for example: text under the column of sports news, entertainment news and the like under the news classification; text under content classification is also possible, for example: text in different slabs in each forum; of course, text under insurance classification or other forms of text formed in natural language may also be used, which is not limited in the embodiment of the present application. The category labels may be used to describe the type of text (also referred to as categories), and the text content may refer to textual information formed in natural language text.
Specifically, the above text processing may be understood as processing text content in a text, converting the text content into natural language words that are convenient for computer processing, and in order to increase the speed of computer processing, performing corresponding processing on the format and characters of the content in the text, for example: converting text format to TXT format, removing irrelevant words, etc. The text word segmentation is performed on the target text, the text word segmentation can be performed on the text by using a word segmentation tool, the obtaining of the corresponding feature text can be understood as word segmentation on one text, a plurality of word groups can be obtained, and the feature text comprises the plurality of word groups.
S202, encoding the feature text into a multidimensional independent hot vector space through a preset first encoder to obtain a first feature vector of the feature text.
In this embodiment, the encoding rule of the first encoder may be defined by a user, or may be a one-hot (one-hot) encoder using an encoding rule disclosed on the internet, and may encode a word in the feature text into a one-hot vector, so as to obtain a first feature vector. The characteristic text is encoded into the multidimensional single-hot vector space, so that the characteristic text can be encoded rapidly.
S203, encoding the first feature vector into a word vector space through a preset second encoder to obtain a second feature vector of the first feature vector.
In the embodiment of the application, a text is converted into a vector in a multidimensional vector space by using a deep learning mode for calculation. The second encoder may be word2vec, and encodes the first feature vector of the obtained feature text as an input vector of word2vec, and may map the first feature vector to a word vector space to seek a deeper feature representation for the text data. The input first feature vector may be predicted using a continuous bag of words model in word2vec to obtain a second feature vector (word vector).
S204, inputting the second feature vector and the classification label into a third encoder, training the third encoder, and iterating a loss function of the third encoder to enable hidden layer vectors in the third encoder to meet that the similarity of the texts with the same type is larger than that of the texts with different types, so as to obtain a target coding network.
In the embodiment of the present application, the hidden layer may not set an activation function (Activation Function), that is, only the feature vector of the hidden layer is needed, and the loss function of the third encoder may be:
wherein the function L R () As a square error function of the basic loss function, L R (y n ,x n )=‖y n -x n ‖^2,L T (h 0 ,h 1 ,h 2 )=Sim(h 0 ,h 1 )-Sim(h 0 ,h 2 ) Alpha is a real number between 0 and 1, sim () is an inner product function. And classifying the labels until the calculation result enables the hidden layer vector in the third encoder to meet that the similarity of the same type of text is greater than that of different types of text, for example: x is X 0 And X is 1 For the same text type, the similarity is 80%, X 2 And X is 0 The similarity is 1% for different text types. Of course, similarity may also refer to distance, for example: beijing and Tianjin are the same text types, beijing and Xinjiang are different text types. In this way, the characterizability of the text vector may be enhanced.
S205, acquiring a text to be processed, performing text processing and text word segmentation on the text to be processed, and inputting the text to be processed into a target coding network to obtain a text vector of the text to be processed.
In the embodiment of the application, the text to be processed can be the text which needs to be subjected to feature extraction according to the text information, and can be a newly added text, such as a text which is newly uploaded or newly grabbed by a user. The hidden layer of the third encoder has no activation function, and can be used for obtaining the text vector of the text to be processed through the hidden layer of the third encoder in the target coding network, namely, the text to be processed after the text processing and the text word segmentation are finished is input into the third encoder which is trained and can finish classification, so that the text vector with classification attribute is obtained.
According to the method, the target text is subjected to text word segmentation, the characteristic text is encoded based on the neural network of the first encoder and the second encoder to obtain the first characteristic vector and the second characteristic vector (word vector) corresponding to the first characteristic vector, the second characteristic vector and the classification label are input into the third encoder for training, the second characteristic vector is damaged or polluted, the model is trained to meet the condition that the similarity of the same type of text is greater than that of different types of text, the acquired text to be processed is subjected to text processing and text word segmentation, and then the acquired text to be processed is input into the third encoder after training is finished for encoding, so that the obtained text vector is more stable, and the representation capability of the text vector formed by the word vector is enhanced.
Further, as shown in fig. 3, the step of S201 includes:
s301, performing punctuation removal processing on the text to obtain a first text;
s302, performing capitalization and lowercase processing on the first text to obtain a second text;
s303, performing full-angle and half-angle conversion on the second text to obtain a target text.
The character string can be processed through a regular expression of regular matching, and rules of the character occurrence in the character string can be described by using specific characters, so that the character string conforming to a certain rule can be matched, extracted or replaced, and the character string can be searched, deleted and replaced, and the search speed is high and accurate.
Specifically, a symbol expression is used for matching a text, when the symbol expression is matched to the text, punctuation marks exist in the text, deleting the punctuation marks to obtain a first text after deleting, wherein the symbol expression can be a regular expression used for matching the punctuation marks in the text, and the specific regular expression can be "\pP+ - $ = | = < = | > $x > -'. For example: the presence text is: [ health ] these "manifestations" appear in the body at night when sleeping, possibly with disease. Matching the text by using symbol expressions "\pP+ - $\p= | $\n = | < >, x $\p +, and obtaining that the symbols in the text are as follows: "[ METHOD ],", "" "" ",", "-! The symbol "[ MEANS FOR SOLVING ],", "" ",", "], is given by! "after deletion, the first text after processing is obtained as follows: the appearance of these manifestations in the body at healthy night while sleeping may be that the body suffers from a disease.
More specifically, according to the obtained first text, traversing each character in the first text, matching each character in the first text by using a letter conversion expression, and if the character is a capital letter, converting the capital letter into a lowercase letter until all characters are matched, and taking the matched first text as a second text. Wherein, the letter conversion expression may refer to a regular expression dedicated to matching uppercase letters in the first name and converting uppercase letters into lowercase letters, and a specific regular expression thereof may be $reg= '/(\w+)/e'. Then, the second text may be imported into a preset conversion library to perform half-angle conversion processing, so as to obtain the target text after conversion processing, where the preset conversion library may be a database for identifying full-angle characters in the second text and converting the full-angle characters into half-angle characters, and the database may specifically perform processing by using regular matching or processing by using a preset script.
In this way, the processing of punctuation deletion, primary and secondary capitalization and full-angle conversion and half-angle conversion is carried out on the text, so that a target text is obtained, and the processing speed of a computer can be enhanced.
Further, as shown in fig. 4, the step of S201 further includes:
s401, performing word segmentation on the target text through a word segmentation device to obtain a word segmentation result; and
s402, forming the word segmentation result into characteristic text.
In the embodiment of the application, the target text is imported into a jieba word segmentation device, a word segmentation mode is selected for word segmentation, the word segmentation mode can comprise a full mode, a precise mode, a new word recognition, a search engine mode and the like, wherein the new word recognition can be used for adding new words in a self-defined mode, and the precise mode is preferred for word segmentation in the embodiment. For example: if the target text is: the famous scenic spots of the palace comprise a dry palace, a Taihe palace, huang Liuli watts and the like, and word segmentation results can be obtained through precise mode word segmentation: the palace/famous spots/inclusions/dry/clear palace/too and palace/and yellow/glazed tiles/etc. The word segmentation result obtained can be used as a characteristic text.
Therefore, the word segmentation device is used for segmenting the feature text corresponding to the target text, and the encoder can be beneficial to encoding the feature text.
Further, as shown in fig. 5, after S401, the steps include:
s501, detecting whether a stop word exists in a word segmentation result through a preset stop word library;
s502, deleting the stop word if the stop word exists.
In the embodiment of the application, all stop words can be obtained from a preset stop word stock, then each word in the word segmentation result can be compared with the stop word, when at least one of the stop words is contained in the word segmentation result, the same word in the word segmentation result as the stop word is deleted, and the word segmentation result after the deletion processing is used as the word segmentation result representing the text feature; or when it is detected that no stop word exists in the word segmentation result, the word segmentation result can be directly used as the word segmentation result representing the text feature. The preset stop word library refers to a database which can be used for storing stop words.
Therefore, the stop words appearing in the word segmentation result are further deleted by comparing the word segmentation result with the stop words in the stop word library, and the word segmentation result with more text characteristics is obtained.
Further, as shown in fig. 6, the step of S203 includes:
s601, performing dimension reduction on the first feature vector through a weight matrix preset from an input layer to an hidden layer in the second encoder to obtain a second feature vector of the hidden layer.
In the embodiment of the application, the weight matrix can realize dimension reduction of the first feature vector, and the weight matrix can be preset between the input layer and the hidden layer of the second encoder. The first feature vector is subjected to dimension reduction to obtain a second feature vector, the second feature vector is used for representing word vectors and number, the second feature vector can be represented in a matrix form, the word vectors can comprise dimension values (the number of columns in the matrix represents the number of dimensions), and the number of words can be the number of word vectors (namely the number of words in a dictionary is the number of rows in the matrix).
Thus, the second eigenvector is encoded into the weight matrix for dimension reduction, so that dimension disasters can be reduced, and the calculated amount can be reduced.
Further, as shown in fig. 7, S204 includes the steps of:
s701, inputting the second feature vector and the classification label into a noise reduction automatic encoder, and randomly damaging the second feature vector to obtain a third feature vector;
s702, training the noise reduction automatic encoder based on the third feature vector.
In the embodiment of the application, the second feature vector (word vector) and the classification label are input into the noise reduction automatic encoder for processing, so that the word vector is polluted or randomly damaged, the polluted or randomly damaged word vector is used as a third feature vector, and then the noise reduction automatic encoder is trained through the third feature vector.
Therefore, when the word vector is damaged or pollutes the trained target coding network, the obtained text vector is more stable, and the robustness of the noise reduction automatic encoder and the coding network can be increased, so that the robustness of the whole target coding network is improved.
Further, as shown in fig. 8, the step of S204 further includes:
s801, calculating an inner product among text vectors of each text through classifying labels;
s802, comparing inner product results of the texts to obtain similarity of the texts;
s803, forming a target coding network according to the similarity of the texts, wherein the target coding network comprises a first coder, a second coder and a third coder.
In the embodiment of the application, sentence vectors can be endowed with text type attribute through classifying tags, and the inner product among the texts can be used for tableShowing similarity between texts, for example: there are three kinds of text x 0 、x 1 、x 2 Wherein x is 0 、x 1 X is the same type of text 0 、x 2 For different types of text, the corresponding feature vectors (text vectors) in the hidden layer of the third encoder are h 0 、h 1 、h 2 The weight is adjusted through training, so that h 0 And h 1 Similarity of (2) is greater than h 0 And h 2 Similarity of (3):
Sim(h 0 ,h 1 )>Sim(h 0 ,h 2 )
i.e. x 0 For target text, x 1 Is equal to x 0 Text of the same type X 2 Is equal to X 0 Different types of text, i.e. at least one text of the same type and different types is found for each text X. For example: x0 is in jpg format, X1 is in jpg format, and X2 is in docx format.
In this way, the sentence vector is provided with text type attribute according to the classification label, and then the text of the same type as the target text and the text of different types are found by calculating the similarity between each text which has completed classification and the target text, so that the characterization capability of the text is enhanced.
According to the application, a target text is formed by performing punctuation removal, capitalization and full-angle transformation on the text, further, text word segmentation processing and deletion of stop words are performed on the target text, then a characteristic text is formed, the characteristic text is encoded to a multi-dimensional hot vector space based on a first encoder, a first characteristic vector is output, then the first characteristic vector is input to a weight matrix of an implicit layer based on a neural network of a second encoder, a corresponding second characteristic vector (word vector) is output after the dimension reduction is performed, then the second characteristic vector and a classification label are input to a noise reduction automatic encoder for training, the second characteristic vector is damaged randomly or polluted, an inner product between the texts is calculated, an inner product result is compared, a model is trained to satisfy the similarity of the texts of the same type and is larger than that of the texts of different types, the obtained text to be processed has text type attribute, the obtained text to be processed is input to the noise reduction automatic encoder after text word segmentation is performed, the corresponding sentence vector has text type attribute, and the text to form a text with the capability of the corresponding vector more stable character vector, and the text of the text is characterized by the vector is formed.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
As shown in fig. 9, a schematic structural diagram of a text vector obtaining apparatus according to the present embodiment is provided, where the apparatus 900 includes: processing module 901, first encoding module 902, second encoding module 903, training module 904, input module 905. Wherein:
the processing module 901 is configured to perform text processing on at least two different types of texts to obtain a target text, and perform text word segmentation on the target text to obtain a corresponding feature text, where the text includes a classification tag and text content;
the first encoding module 902 is configured to encode the feature text into a multidimensional independent hot vector space through a preset first encoder, so as to obtain a first feature vector of the feature text;
a second encoding module 903, configured to encode the first feature vector into a word vector space through a preset second encoder, to obtain a second feature vector of the first feature vector;
the training module 904 is configured to input the second feature vector and the classification label into a third encoder, train the third encoder, iterate a loss function of the third encoder, and make a hidden layer vector in the third encoder satisfy that the similarity of the same type of text is greater than the similarity of different types of text, so as to obtain a target coding network;
the input module 905 is configured to obtain a text to be processed, perform text processing and text word segmentation on the text to be processed, and input the text to a target coding network to obtain a text vector of the text to be processed.
Further, as shown in fig. 10, a schematic structural diagram of an embodiment of a processing module 901 includes: a first processing sub-module 9011, a second processing sub-module 9012, and a third processing sub-module 9013. Wherein, the liquid crystal display device comprises a liquid crystal display device,
the first processing sub-module 9011 is configured to perform punctuation removal processing on the text to obtain a first text;
a second processing sub-module 9012, configured to perform uppercase-lowercase processing on the first text, to obtain a second text;
and the third processing sub-module 9013 is configured to perform full-angle to half-angle processing on the second text, so as to obtain the target text.
Further, as shown in fig. 11, a schematic structural diagram of another embodiment of the processing module 901 further includes: a word segmentation submodule 9014 and a first generation submodule 9015. Wherein, the liquid crystal display device comprises a liquid crystal display device,
the word segmentation submodule 9014 is used for carrying out word segmentation processing on the target text through a word segmentation device to obtain a word segmentation result; a kind of electronic device with high-pressure air-conditioning system;
a first generating sub-module 9015 is configured to form the word segmentation result into a feature text.
Further, as shown in fig. 12, a schematic structural diagram of another embodiment of the processing module 901 further includes: detection submodule 9016 and deletion submodule 9017. Wherein, the liquid crystal display device comprises a liquid crystal display device,
the detection submodule 9016 is used for detecting whether a stop word exists in the word segmentation result through a preset stop word bank;
the deletion sub-module 9017 is configured to delete the stop word if present.
Further, the second encoding module 903 is further configured to perform dimension reduction on the first feature vector by presetting a weight matrix from the input layer to the hidden layer in the second encoder, so as to obtain a second feature vector of the hidden layer.
Further, as shown in fig. 13, a schematic structural diagram of an embodiment of the training module 904 includes: input submodule 9041 and training submodule 9042. Wherein, the liquid crystal display device comprises a liquid crystal display device,
an input submodule 9041, configured to input the second feature vector and the classification tag into a noise-reduction automatic encoder, and randomly damage the second feature vector to obtain a third feature vector;
a training submodule 9042 is configured to train the noise-reduction automatic encoder based on the third feature vector.
Further, as shown in fig. 14, a schematic structural diagram of another embodiment of the training module 904 includes: a calculation sub-module 9043, a comparison sub-module 9044, and a second generation sub-module 9045. Wherein, the liquid crystal display device comprises a liquid crystal display device,
a calculation sub-module 9043 for calculating an inner product between text vectors of the respective texts by classifying the tags;
a comparison sub-module 9044 for comparing the inner product results of the texts to obtain the similarity of the texts
The second generating submodule 9045 is configured to form a target coding network according to the similarity of the texts, where the target coding network includes a first encoder, a second encoder, and a third encoder.
The text vector obtaining device provided by the embodiment of the present application can implement each implementation manner in the method embodiments of fig. 2 to 8, and corresponding beneficial effects, and in order to avoid repetition, a description is omitted here.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 15, fig. 15 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 15 includes a memory 151, a processor 152, and a network interface 153 communicatively coupled to each other via a system bus. It should be noted that only computer device 15 having components 151-153 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing according to predetermined or stored instructions, and the hardware thereof includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, and the like.
The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device can perform man-machine interaction with the client through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 151 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 151 may be an internal storage unit of the computer device 15, such as a hard disk or memory of the computer device 15. In other embodiments, the memory 151 may also be an external storage device of the computer device 15, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 15. Of course, the memory 151 may also include both internal storage units of the computer device 15 and external storage devices. In the present embodiment, the memory 151 is typically used to store an operating system and various types of application software installed on the computer device 15, such as program code of a text vector acquisition method. Further, the memory 151 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 152 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 152 is generally used to control the overall operation of the computer device 15. In this embodiment, the processor 152 is configured to execute the program code stored in the memory 151 or process data, such as program code for executing a text vector retrieving method.
Network interface 153 may include a wireless network interface or a wired network interface, and network interface 153 is typically used to establish communications connections between computer device 15 and other electronic devices.
The present application also provides another embodiment, namely, a computer-readable storage medium storing a text vector retrieving program executable by at least one processor to cause the at least one processor to perform the steps of a text vector retrieving method as described above.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform a text vector retrieving method according to various embodiments of the present application.
The terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.

Claims (8)

1. A text vector retrieval method, comprising the steps of:
performing text processing on at least two different types of texts to obtain a target text, and performing text word segmentation on the target text to obtain a corresponding characteristic text, wherein the text comprises a classification tag and text content;
encoding the characteristic text into a multidimensional independent heat vector space through a preset first encoder to obtain a first characteristic vector of the characteristic text;
encoding the first feature vector into a word vector space through a preset second encoder to obtain a second feature vector of the first feature vector;
inputting the second feature vector and the classification label into a third encoder, training the third encoder, iterating a loss function of the third encoder, and enabling hidden layer vectors in the third encoder to meet that the similarity of the same type of text is larger than that of different types of text, so as to obtain a target coding network;
the specific steps of inputting the second feature vector and the classification label into a third encoder and training the third encoder include:
inputting the second feature vector and the classification label into a noise reduction automatic encoder, and randomly damaging the second feature vector to obtain a third feature vector;
training the noise reduction automatic encoder based on the third feature vector;
iterating the loss function of the third encoder to enable the hidden layer vector in the third encoder to meet that the similarity of the same type of text is larger than that of different types of text, and obtaining the target coding network comprises the following specific steps:
calculating an inner product among text vectors of each text through the classification labels;
comparing the inner product results of the texts to obtain the similarity of the texts;
forming the target coding network according to the similarity of the texts, wherein the target coding network comprises the first encoder, the second encoder and the third encoder;
and acquiring a text to be processed, performing text processing and text word segmentation on the text to be processed, and inputting the text to be processed into the target coding network to obtain a text vector of the text to be processed.
2. The method for obtaining text vectors according to claim 1, wherein the step of performing text processing on at least two different types of text to obtain the target text comprises:
performing punctuation mark removal processing on the text to obtain a first text;
performing capitalization and lowercase treatment on the first text to obtain a second text;
and performing full-angle and half-angle conversion on the second text to obtain a target text.
3. The method for obtaining a text vector according to claim 1, wherein the step of performing text segmentation on the target text to obtain the corresponding feature text comprises:
performing word segmentation processing on the target text through a word segmentation device to obtain a word segmentation result; and
and forming the word segmentation result into characteristic text.
4. A method of obtaining a text vector according to claim 3, wherein after the word segmentation result is obtained, the method comprises the steps of:
detecting whether a stop word exists in the word segmentation result through a preset stop word library;
and if so, deleting the stop word.
5. The text vector retrieving method according to claim 1, wherein the step of encoding the first feature vector into a word vector space by a second encoder set in advance to obtain a second feature vector of the first feature vector includes:
and reducing the dimension of the first feature vector by presetting a weight matrix from an input layer to an hidden layer in a second encoder to obtain a second feature vector of the hidden layer.
6. A text vector retrieving apparatus, comprising:
the processing module is used for carrying out text processing on at least two different types of texts to obtain a target text, and carrying out text word segmentation on the target text to obtain a corresponding characteristic text, wherein the text comprises a classification label and text content;
the first coding module is used for coding the characteristic text into a multidimensional independent hot vector space through a preset first coder to obtain a first characteristic vector of the characteristic text;
the second coding module is used for coding the first feature vector into a word vector space through a preset second coder to obtain a second feature vector of the first feature vector;
the training module is used for inputting the second feature vector and the classification label into a third encoder, training the third encoder, iterating a loss function of the third encoder, and enabling hidden layer vectors in the third encoder to meet that the similarity of texts of the same type is larger than that of texts of different types, so that a target coding network is obtained;
the input module is used for acquiring a text to be processed, carrying out the text processing and the text word segmentation on the text to be processed, and inputting the text to the target coding network to obtain a text vector of the text to be processed;
the training module comprises:
the input sub-module is used for inputting the second feature vector and the classification label into the noise reduction automatic encoder, and randomly damaging the second feature vector to obtain a third feature vector;
the training sub-module is used for training the noise reduction automatic encoder based on the third feature vector;
the training module further comprises:
the computing sub-module is used for computing the inner product among the text vectors of each text through the classification labels;
a comparison sub-module for comparing the inner product results of the texts to obtain the similarity of the texts
And the second generation submodule is used for forming a target coding network according to the similarity of the texts, wherein the target coding network comprises a first encoder, a second encoder and a third encoder.
7. A computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of a text vector retrieving method according to any of claims 1 to 5 when the computer program is executed.
8. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of a text vector retrieving method according to any of claims 1 to 5.
CN201910637101.2A 2019-07-15 2019-07-15 Text vector acquisition method and device, computer equipment and storage medium Active CN110532381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910637101.2A CN110532381B (en) 2019-07-15 2019-07-15 Text vector acquisition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910637101.2A CN110532381B (en) 2019-07-15 2019-07-15 Text vector acquisition method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110532381A CN110532381A (en) 2019-12-03
CN110532381B true CN110532381B (en) 2023-09-26

Family

ID=68660195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910637101.2A Active CN110532381B (en) 2019-07-15 2019-07-15 Text vector acquisition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110532381B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079442B (en) 2019-12-20 2021-05-18 北京百度网讯科技有限公司 Vectorization representation method and device of document and computer equipment
WO2021134416A1 (en) * 2019-12-31 2021-07-08 深圳市优必选科技股份有限公司 Text transformation method and apparatus, computer device, and computer readable storage medium
CN111241820A (en) * 2020-01-14 2020-06-05 平安科技(深圳)有限公司 Bad phrase recognition method, device, electronic device, and storage medium
CN111445545B (en) * 2020-02-27 2023-08-18 北京大米未来科技有限公司 Text transfer mapping method and device, storage medium and electronic equipment
CN110990837B (en) * 2020-02-29 2023-03-24 网御安全技术(深圳)有限公司 System call behavior sequence dimension reduction method, system, equipment and storage medium
CN112214965A (en) * 2020-10-21 2021-01-12 科大讯飞股份有限公司 Case regulating method and device, electronic equipment and storage medium
CN112528681A (en) * 2020-12-18 2021-03-19 北京百度网讯科技有限公司 Cross-language retrieval and model training method, device, equipment and storage medium
CN112749530B (en) * 2021-01-11 2023-12-19 北京光速斑马数据科技有限公司 Text encoding method, apparatus, device and computer readable storage medium
CN115047894B (en) * 2022-04-14 2023-09-15 中国民用航空总局第二研究所 Unmanned aerial vehicle track measuring and calculating method, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408702A (en) * 2018-08-29 2019-03-01 昆明理工大学 A kind of mixed recommendation method based on sparse edge noise reduction autocoding
CN109582786A (en) * 2018-10-31 2019-04-05 中国科学院深圳先进技术研究院 A kind of text representation learning method, system and electronic equipment based on autocoding
CN109885826A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Text term vector acquisition methods, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503236B (en) * 2016-10-28 2020-09-11 北京百度网讯科技有限公司 Artificial intelligence based problem classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408702A (en) * 2018-08-29 2019-03-01 昆明理工大学 A kind of mixed recommendation method based on sparse edge noise reduction autocoding
CN109582786A (en) * 2018-10-31 2019-04-05 中国科学院深圳先进技术研究院 A kind of text representation learning method, system and electronic equipment based on autocoding
CN109885826A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Text term vector acquisition methods, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"面向聚类的堆叠降噪自动编码器的特征提取研究";张素智 等;《现代计算机》;全文 *

Also Published As

Publication number Publication date
CN110532381A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
CN112685565B (en) Text classification method based on multi-mode information fusion and related equipment thereof
CN111444340B (en) Text classification method, device, equipment and storage medium
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN110866098B (en) Machine reading method and device based on transformer and lstm and readable storage medium
CN111475617A (en) Event body extraction method and device and storage medium
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
EP4191544A1 (en) Method and apparatus for recognizing token, electronic device and storage medium
CN113987125A (en) Text structured information extraction method based on neural network and related equipment thereof
CN113505601A (en) Positive and negative sample pair construction method and device, computer equipment and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN112560504A (en) Method, electronic equipment and computer readable medium for extracting information in form document
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN111191011B (en) Text label searching and matching method, device, equipment and storage medium
CN115730237B (en) Junk mail detection method, device, computer equipment and storage medium
CN112507388B (en) Word2vec model training method, device and system based on privacy protection
CN114091451A (en) Text classification method, device, equipment and storage medium
CN114445833A (en) Text recognition method and device, electronic equipment and storage medium
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN109933788B (en) Type determining method, device, equipment and medium
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN112732913B (en) Method, device, equipment and storage medium for classifying unbalanced samples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant