WO2019080864A1 - 一种文本语义编码方法及装置 - Google Patents

一种文本语义编码方法及装置

Info

Publication number
WO2019080864A1
WO2019080864A1 PCT/CN2018/111628 CN2018111628W WO2019080864A1 WO 2019080864 A1 WO2019080864 A1 WO 2019080864A1 CN 2018111628 W CN2018111628 W CN 2018111628W WO 2019080864 A1 WO2019080864 A1 WO 2019080864A1
Authority
WO
WIPO (PCT)
Prior art keywords
semantic
vector
word
text data
text
Prior art date
Application number
PCT/CN2018/111628
Other languages
English (en)
French (fr)
Inventor
王成龙
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to JP2020520227A priority Critical patent/JP2021501390A/ja
Priority to US16/754,832 priority patent/US20200250379A1/en
Publication of WO2019080864A1 publication Critical patent/WO2019080864A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the embodiments of the present invention relate to the field of computer technologies, and in particular, to a text semantic coding method and apparatus.
  • a user Q&A service is required.
  • Internet applications provide consulting services about their features to help users better understand their product features.
  • the user and the customer service mainly use natural language text to communicate.
  • many service providers choose to use text mining or information retrieval technology to provide users with automatic question and answer services instead of high and poorly scalable human resources.
  • variable length text In order to mine and process the natural language text related to the question and answer, it needs to be numerically encoded, that is, text encoding processing.
  • V is the size of the dictionary, and each bit of the vector represents a word whose value is the number of occurrences of the word in the text.
  • this coding method only uses the word frequency information in the text, and ignores the context dependency between words and words, so it is difficult to fully express the semantic information contained in the text.
  • bag-of-words encoding length is the size of the entire dictionary (usually on the order of hundreds of thousands), with the vast majority of the encoding values being zero.
  • the sparseness of the encoding is not conducive to subsequent text mining, and the length of the encoding is too long to greatly reduce the subsequent text processing speed.
  • a word embedding technique for encoding text.
  • the method uses a fixed-length floating-point value vector to express the text semantics.
  • the Word embedding encoding method is a compressed data representation. Specifically, a fixed-length (usually 100-dimensional) floating-point value vector is used to express text semantics. Compared with the bag-of-word encoding method, the dimension is greatly reduced, which effectively solves the problem of data sparsity, and can greatly improve the subsequent text processing speed.
  • word embedding coding usually requires pre-training, that is, it is necessary to determine which text to encode during offline training.
  • the algorithm is commonly used to encode and express short texts such as words or phrases that are exaggerated.
  • text at the sentence and paragraph level is indefinitely long sequence data, and such variable length sequence data cannot be enumerated, so the code cannot be obtained through pre-training. Therefore, the text encoding method provided by the prior art has a defect that it is impossible to accurately encode indefinite long text data.
  • the embodiment of the present application provides a text semantic coding method and device, which aims to solve the technical problem that the prior art cannot accurately determine the length of the text data.
  • a first aspect of the embodiments of the present application discloses a text semantic encoding method, including: generating a word vector matrix according to text data; inputting the word vector matrix into a bidirectional cyclic neural network for performing a preprocessing operation, and obtaining a context semantic for representing a word An output vector of the relationship; performing a convolution operation on the output vector to obtain a convolution processing result; the convolution processing result is related to the subject; performing a pooling operation on the convolution processing result to obtain a fixed length vector Semantic encoding of the text data, the semantic encoding being used to characterize the subject matter of the text data.
  • a second aspect of the embodiments of the present application discloses a text semantic encoding apparatus, including: a word vector matrix generating unit, configured to generate a word vector matrix according to text data; and a preprocessing unit, configured to input the word vector matrix into two directions.
  • the cyclic neural network performs a preprocessing operation to obtain an output vector for representing a contextual semantic relationship of the word; a convolution processing unit configured to perform a convolution operation on the output vector to obtain a convolution processing result; the convolution processing result and a subject-related; a pooling processing unit, configured to perform a pooling operation on the convolution processing result to obtain a fixed length vector as a semantic encoding of the text data, the semantic encoding being used to represent a theme of the text data .
  • an apparatus for text semantic encoding including a memory, and one or more programs, wherein one or more programs are stored in a memory and configured to be Or the one or more processors executing the one or more programs comprising instructions for: generating a word vector matrix according to the text data; inputting the word vector matrix into the bidirectional cyclic neural network for performing a preprocessing operation, obtaining the representation An output vector of the semantic relationship of the word context; performing a convolution operation on the output vector to obtain a convolution processing result; the convolution processing result is related to the subject; performing a pooling operation on the convolution processing result to obtain a fixed length
  • the vector is used as a semantic encoding of the text data, the semantic encoding being used to characterize the subject matter of the text data.
  • a machine readable medium having stored thereon instructions, when executed by one or more processors, causes the apparatus to perform the text semantic encoding method as described in the first aspect.
  • the text semantic coding method and device can process the indefinite length text data from different data sources, generate a word vector matrix, input the word vector matrix into the bidirectional cyclic neural network for preprocessing, and then perform the cyclic nerve
  • the output of the network performs linear convolution operations and pooling operations, and finally obtains a fixed-length floating-point value vector as a semantic encoding of indefinite length text data for subsequent text mining tasks.
  • the embodiment of the present application can mine the semantic relationship of the text and the association between the text and the topic, and realize the fixed semantic coding of the indefinite length text data.
  • FIG. 1 is a schematic diagram of an application scenario according to an embodiment of the present application
  • FIG. 2 is a flowchart of a text semantic coding method according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a text semantic encoding method according to another embodiment of the present application.
  • FIG. 4 is a schematic diagram of a text semantic encoding apparatus according to an embodiment of the present application.
  • FIG. 5 is a block diagram of a device for text semantic encoding, according to an exemplary embodiment
  • FIG. 6 is a flowchart of a text semantic coding method according to another embodiment of the present application.
  • FIG. 7 is a schematic diagram of a text semantic encoding apparatus according to another embodiment of the present application.
  • the embodiment of the present application provides a text semantic coding method and device, which can implement text semantic coding of indefinite length text data.
  • text coding generally refers to a vectorized representation of a textual language of indefinite length.
  • a length of natural language text with uncertain length can be identified as a fixed-length floating-point value vector by text encoding.
  • FIG. 1 is an exemplary application scenario of an embodiment of the present application.
  • the method provided by the embodiment of the present application can be applied to the scenario shown in FIG. 1 to implement semantic coding of text.
  • the embodiment of the present application can also be applied to other scenarios, and is not limited herein.
  • text data may be collected by the electronic device 100 , such as indefinite length text 1 , indefinite length text 2 , indefinite length 3 , indefinite length 4 , and text data.
  • the length is not the same.
  • a fixed length semantic encoding is generated by word segmentation, word vector matrix generation, bidirectional cyclic neural network preprocessing, convolution, and pooling processing operations.
  • the lengths of the text semantic codes 1, 2, 3, and 4 are all the same, so that the conversion of the variable length text data to the fixed length text semantic coding is realized, and the theme reflected by the text can be represented by the text semantic coding. Provides the foundation for subsequent data mining.
  • FIG. 2 is a flowchart of a text semantic coding method according to an embodiment of the present application. As shown in FIG. 2, it may include:
  • S201 may further comprise the following steps:
  • text data of different data sources can be collected as text data.
  • the user input question can be used as text data.
  • the question input by the user is: “How is this function used?”
  • the feedback of the customer service in the question and answer system can also be collected as text data, for example, customer service.
  • the text of the feedback is: “The operation steps of the product sharing function are: login to Taobao account, open the product page, click the share button, select Alipay friends, click the send button to complete the product sharing function.”
  • other text data can also be collected as Text data is not limited here.
  • the text data is indefinite length text data. That is to say, the length of the text data is not fixed and can be any natural language text.
  • S201B Perform word segmentation on the text data to obtain a sequence of words.
  • the obtained sequence of words can be expressed as:
  • w i represents the word after the i-th participle in the input text
  • represents the length of the text after the word segmentation.
  • the text data "how to use this function" can be expressed as [this, function, how, use, ah] after word segmentation.
  • the length of the word sequence is 5, which means that it consists of 5 words.
  • S201C Determine a word vector corresponding to each word in the sequence of words, and generate a word vector matrix.
  • the word vector matrix can be obtained by encoding with the word vector word embedding:
  • represents the pre-trained word vector word embedding matrix
  • represents the number of words in the word vector matrix
  • d represents the word vector word embedding code length
  • R represents the real space
  • LT represents the lookup table function .
  • Each column of the matrix represents the word embedding code of a word.
  • any text can be represented as a matrix S of d ⁇
  • the word vector word embedding is a natural language processing coding technique, which can generate a word vector matrix of
  • the vector represents the encoding of the word "how”,
  • the first participle is "How to use this function”, and then check the corresponding encoding vector for each word.
  • the vector corresponding to "this” is [-0.01, 0.03, 0.02, ..., 0.06], these five words each have their own vector expression, and the five vectors are combined together, which is a matrix representing the sentence.
  • the inputting the word vector matrix into the bidirectional cyclic neural network for performing a preprocessing operation, and obtaining an output vector for representing a semantic relationship of the word context comprises: inputting the word vector matrix into a bidirectional cyclic neural network, and adopting The long- and short-term memory network LSTM operator is used to calculate the semantic dependence of each word vector and the above by forward processing. Through the backward processing, the word vector and the following semantic dependence are obtained, and the word vectors are compared with the above and below. Semantic dependencies are used as output vectors.
  • a bidirectional cyclic neural network may be used for preprocessing.
  • the computing unit of the network uses the LSTM (Long-Short Term Memory) operator.
  • the bidirectional cyclic neural network includes a forward process (processing order is w 1 ⁇ w
  • the forward process For each input vector v i , the forward process generates an output vector
  • the corresponding backward process also generates an output vector.
  • These vectors contain the corresponding word w i and its semantic information above (for the forward process) or below (for the backward process). Then, use the following formula to deal with:
  • h i is used as the intermediate code of the corresponding w i .
  • the generated vector is processed for the input word i to represent the semantic dependence of the word i and the above;
  • the generated vector is processed for the input word i to represent the semantic dependence of the word i with the following.
  • the convolution operation of the output vector to obtain the convolution processing result includes:
  • a convolution kernel F ⁇ R d ⁇ m (m is the size of the convolution window) can be used to perform a linear convolution operation on H ⁇ R 2d ⁇
  • the convolution kernel F is related to the subject.
  • the performing a linear convolution operation on the output vector using a convolution kernel comprises: convolving the output vector H using a set of convolution kernels F using:
  • c ji is the result vector of the convolution operation
  • H is the output vector of the bidirectional cyclic neural network
  • F j is the jth convolution kernel
  • b i is the bias value corresponding to the convolution kernel F j
  • i is an integer
  • m is the convolution window size.
  • a convolution operation is usually performed on a set of convolution kernels F ⁇ R n ⁇ d ⁇ m to obtain a matrix C ⁇ R n ⁇ (
  • C represents the result vector of the convolution operation.
  • each convolution kernel corresponds to a bias value b i .
  • each convolution kernel is a two-dimensional vector whose size needs to be debugged according to different application scenarios, and the value of the vector is obtained through supervised learning.
  • the convolution kernel is obtained by using neural network training, and the vector corresponding to the convolution kernel can be obtained through training sample monitoring.
  • S203B Perform nonlinear transformation processing on the linear convolution operation result to obtain a convolution processing result.
  • A represents the result variable after Relu processing.
  • a ij represents a variable in A. After the above processing, each a ij is processed to a value greater than or equal to zero.
  • S204 Perform a pooling operation on the convolution processing result to obtain a fixed length vector as a semantic encoding of the text data, where the semantic encoding is used to represent a theme of the text data.
  • the maximum pooling operation process is performed on the convolution processing result to eliminate the lengthening of the result, and a fixed-length floating-point value vector is obtained as the semantic encoding of the text data; Each value of the vector is used to indicate how much the text reflects the subject.
  • the matrix A obtained in S203 is processed by a maximum pooling operation.
  • the pooling operation plays a role in eliminating the "lengthening".
  • each row of the matrix A corresponds to a floating-point value vector obtained by convolution operation through a convolution kernel, and takes the maximum value in the vector, as shown in the following formula:
  • the final result P ⁇ R n is the final encoding of the target text.
  • each bit on the result vector P represents a "subject", and the value on this bit represents the degree of reflection on the "subject”.
  • semantic encoding After obtaining the semantic encoding corresponding to the text data, different processing can be performed on the semantic encoding. For example, since the acquired text is semantically encoded as a floating-point value vector, a common operation for the vector can be used for subsequent processing. For example, the cosine distance of the two codes can be calculated to represent the similarity of the two pieces of text.
  • the present application does not limit the subsequent processing of the text semantic encoding.
  • FIG. 3 is a schematic diagram of a text semantic coding method according to an embodiment of the present application.
  • the target text "How to use this function", after word segmentation, it can be expressed as [this, function, how, use, ah].
  • the word vector is encoded for each participle, and the word vector matrix is input into the bidirectional cyclic neural network for processing to obtain the output result; the output result is subjected to linear convolution processing, nonlinear transformation processing, and the maximum pooling operation is used to eliminate "variable length”. Finally, a fixed length vector is obtained as the semantic encoding of the text.
  • the text data of indefinite length can be processed, first represented as a word vector matrix, and then a bidirectional cyclic neural network and a convolution related operation are used to obtain a fixed length size floating point value code, which is used as the text.
  • the semantic coding realizes the conversion of variable length text data to fixed length text semantic coding, and mines the semantic relationship of the text as well as the theme expression.
  • FIG. 6 is a flowchart of a text semantic coding method according to another embodiment of the present application.
  • generating a word vector matrix according to the text data may include:
  • S601A obtain text data.
  • the text data is specifically indefinite length text data.
  • the specific implementation can be implemented by referring to S201A shown in FIG. 2.
  • S601B performing word segmentation on the text data to obtain a sequence of words.
  • the specific implementation can be implemented by referring to S201B shown in FIG. 2 .
  • S601C Determine a word vector corresponding to each word in the sequence of words, and generate a word vector matrix.
  • the specific implementation can be implemented by referring to S201C shown in FIG. 2 .
  • obtaining an output vector for representing a contextual semantic relationship of the word may include: inputting the word vector matrix into a bidirectional cyclic neural network to perform a preprocessing operation, and obtaining a semantic relationship for expressing a context of the word. Output vector.
  • the word vector matrix can be input into the bidirectional cyclic neural network, and the long-short-term memory network LSTM operator is used for calculation, and the semantic dependence of each word vector and the above is obtained through forward processing, and the word is obtained by backward processing.
  • the semantic dependence of the vector and the following, the semantic dependence of each word vector and the top and bottom is used as the output vector.
  • the output vector can also be obtained in other ways, which is not limited herein.
  • the output vector may be linearly convoluted by using a convolution kernel; the convolution kernel is related to the subject; and the linear convolution operation result is subjected to nonlinear transformation processing to obtain a convolution processing result.
  • a maximum pooling operation may be performed on the convolution processing result to eliminate the lengthening of the result, and a fixed-length floating-point value vector is obtained as a semantic encoding of the text data; wherein each of the vectors The value is used to indicate how much the text reflects the subject.
  • FIG. 4 is a schematic diagram of a text semantic encoding apparatus according to an embodiment of the present application.
  • a text semantic encoding device 400 includes:
  • the word vector matrix generating unit 401 is configured to generate a word vector matrix according to the text data.
  • the specific implementation of the word vector matrix generating unit 401 can be implemented by referring to S201 in the embodiment shown in FIG. 2 .
  • the pre-processing unit 402 is configured to input the word vector matrix into the bidirectional cyclic neural network to perform a pre-processing operation, and obtain an output vector for representing a semantic relationship of the word context.
  • the specific implementation of the pre-processing unit 402 can be implemented by referring to S202 in the embodiment shown in FIG. 2 .
  • the convolution processing unit 403 is configured to perform a convolution operation on the output vector to obtain a convolution processing result; the convolution processing result is related to a topic; wherein a specific implementation of the convolution processing unit 403 can refer to FIG. This is achieved by S203 of the illustrated embodiment.
  • the pooling processing unit 404 is configured to perform a pooling operation on the convolution processing result to obtain a fixed length vector as a semantic encoding of the text data, where the semantic encoding is used to represent a theme of the text data.
  • the specific implementation of the pooling processing unit 404 can be implemented by referring to S204 in the embodiment shown in FIG. 2 .
  • the word vector matrix generating unit 401 may specifically include: an acquiring unit, configured to acquire text data.
  • the specific implementation of the obtaining unit may be implemented by referring to S201A in the embodiment shown in FIG. 2 .
  • a word segmentation unit configured to perform word segmentation on the text data to obtain a word sequence.
  • the specific implementation of the word segmentation unit can be implemented by referring to S201B in the embodiment shown in FIG. 2 .
  • a matrix generating unit configured to determine a word vector corresponding to each word in the sequence of words, and generate a word vector matrix.
  • the specific implementation of the matrix generating unit may be implemented by referring to S201C in the embodiment shown in FIG. 2 .
  • the pre-processing unit is specifically configured to: input the word vector matrix into a bidirectional cyclic neural network, perform calculation using a long-short-term memory network LSTM operator, and obtain each word vector by forward processing Semantic dependencies, through the backward processing to obtain a word vector and the following semantic dependence, the semantic dependence of each word vector and the upper and lower is used as the output vector.
  • LSTM operator long-short-term memory network
  • the convolution processing unit comprises:
  • a convolution unit for performing a linear convolution operation on the output vector using a convolution kernel;
  • the convolution kernel is related to a subject;
  • a nonlinear transform unit configured to perform nonlinear transform processing on the linear convolution operation result to obtain a convolution processing result.
  • the convolution unit is specifically configured to: perform a convolution operation on the output vector H by using a set of convolution kernels F:
  • c ji is the result vector of the convolution operation
  • H is the output vector of the bidirectional cyclic neural network
  • F j is the jth convolution kernel
  • b i is the bias value corresponding to the convolution kernel F j
  • i is an integer
  • m is the convolution window size.
  • the pooling unit is specifically configured to perform a maximum pooling operation process on the convolution processing result to eliminate the lengthening of the result, and obtain a fixed-length floating-point value vector as the semantic encoding of the text data. Wherein each value of the vector is used to indicate how much the text reflects the subject.
  • FIG. 5 is a block diagram of an apparatus for text semantic encoding provided by another embodiment of the present application.
  • the processor 501 is configured to execute executable modules, such as computer programs, stored in the memory 502.
  • the memory 502 may include a high speed random access memory (RAM), and may also include a non-volatile memory such as at least one disk memory.
  • One or more programs are stored in the memory and configured to be executed by one or more processors 501.
  • the one or more programs include instructions for: generating a word vector matrix based on the text data; a predicate vector matrix input bidirectional cyclic neural network performs a preprocessing operation to obtain an output vector for representing a contextual semantic relationship of the word; performing a convolution operation on the output vector to obtain a convolution processing result; the convolution processing result and the subject Correlating; performing a pooling operation on the convolution processing result to obtain a fixed length vector as a semantic encoding of the text data, the semantic encoding being used to represent a subject of the text data.
  • the processor 501 is specifically configured to execute the one or more programs including instructions for inputting the word vector matrix into a bidirectional cyclic neural network, using a long- and short-term memory network LSTM operator.
  • the calculation obtains the semantic dependence of each word vector and the above by forward processing.
  • the word vector and the semantic dependency relationship are obtained, and the semantic dependence of each word vector and the upper and lower is used as the output vector.
  • the processor 501 is specifically configured to execute the one or more programs including instructions for performing a linear convolution operation on the output vector using a convolution kernel; the convolution kernel and a topic Correlation; performing nonlinear transformation processing on the result of the linear convolution operation to obtain a convolution processing result.
  • the processor 501 is specifically configured to execute the one or more programs including instructions for performing a maximum pooling operation process on the convolution processing result to eliminate the lengthening of the result,
  • a fixed length floating point value vector is obtained as a semantic encoding of the text data; wherein each value of the vector is used to indicate how much the text reflects the subject.
  • non-transitory computer readable storage medium comprising instructions, such as a memory comprising instructions executable by a processor of the apparatus to perform the above method.
  • the non-transitory computer readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
  • a machine readable medium for example, a non-transitory computer readable storage medium, when instructions in the medium are executed by a processor of a device (terminal or server), enabling the apparatus to perform a a text semantic coding method, the method comprising: generating a word vector matrix according to the text data; inputting the word vector matrix into the bidirectional cyclic neural network to perform a preprocessing operation, and obtaining an output vector for representing a contextual semantic relationship of the word; Outputting a convolution operation to obtain a convolution processing result; the convolution processing result is related to a subject; performing a pooling operation on the convolution processing result to obtain a fixed length vector as a semantic encoding of the text data, The semantic encoding is used to characterize the subject matter of the textual data.
  • FIG. 7 is a schematic diagram of a text semantic encoding apparatus according to another embodiment of the present application.
  • a text semantic encoding device 700 includes:
  • the word vector matrix generating unit 701 is configured to generate a word vector matrix according to the text data.
  • the specific implementation of the word vector matrix generating unit 701 can be implemented by referring to S601 in the embodiment shown in FIG. 6.
  • the output vector obtaining unit 702 is configured to obtain an output vector for representing a semantic relationship of the word context according to the word vector matrix.
  • the specific implementation of the output vector obtaining unit 702 can be implemented by referring to S602 in the embodiment shown in FIG. 6.
  • the convolution processing unit 703 is configured to obtain a convolution processing result related to the topic according to the output vector.
  • the specific implementation of the convolution processing unit 703 can be implemented by referring to S603 in the embodiment shown in FIG. 6.
  • the semantic encoding obtaining unit 704 is configured to obtain, according to the convolution processing result, a vector of a fixed length as a semantic encoding of the text data for characterizing a theme of the text data.
  • the specific implementation of the semantic coding obtaining unit 704 can be implemented by referring to S604 in the embodiment shown in FIG. 6.
  • each unit or module of the device of the present application can be implemented by referring to the methods shown in FIG. 2, FIG. 3 and FIG. 6, and details are not described herein.
  • the application can be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network.
  • program modules can be located in both local and remote computer storage media including storage devices.
  • the various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供一种文本语义编码方法和装置,所述方法包括:根据文本数据生成词向量矩阵;将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。本申请实施例可以挖掘文本的语义关系以及文本与主题的关联,实现了不定长文本数据的固定语义编码。

Description

一种文本语义编码方法及装置
本申请要求2017年10月27日递交的申请号为201711056845.2、发明名称为“一种文本语义编码方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,具体涉及一种文本语义编码方法及装置。
背景技术
在许多应用场景中,需要提供用户问答服务。例如,互联网应用会提供有关其功能特点的咨询服务,以便帮助用户更好地了解其产品功能。在这些问答服务中,用户与客服之间主要采用自然语言文本进行沟通。随着应用或服务用户数量的增加,客服的压力也会随之增大。因此,众多服务提供商选择采用文本挖掘或信息检索等技术为用户提供自动问答服务,以代替高昂、可扩展性差的人力资源投入。
为了对问答相关的自然语言文本进行挖掘和处理,需要将其进行数值编码,即进行文本编码处理。目前,有一种方法是采用词袋(bag-of-words)技术对不定长文本进行编码。每个不定长文本采用长度为V的整数值向量进行处理。V为词典大小,向量的每一位代表一个词,其值为该词在该文本中的出现次数。然而,这种编码方法仅仅利用了文本中的词频信息,而忽略了词与词之间的上下文依赖关系,因此难以充分表达文本中蕴含的语义信息。此外,bag-of-words编码长度为整个词典的大小(通常在几十万量级),其中绝大部分的编码值为0。编码的稀疏性不利于后续的文本挖掘,且过长的编码长度也会大大降低后续的文本处理速度。
为了解决bag-of-words编码方式存在的问题,出现了一种词向量(word embedding)技术用于对文本进行编码。该方法采用固定长度的浮点值向量对文本语义进行表达。Word embedding编码方式是一种压缩数据表达,具体来讲,是采用固定长度(通常在100维)的浮点值向量来表达文本语义。相对于bag-of-word编码方法,维度大大降低,从而有效解决了数据稀疏性问题,而且可以大大提高后续的文本处理速度。但是,word embedding编码通常是需要预训练的,即在离线训练过程中需要确定要对哪些文本进行编码。因此,该算法通常用于对单词或短语这样可穷举的短文本进行编码表达。然而, 句子和段落级别的文本是不定长序列数据,这类不定长序列数据无法枚举,因此无法通过预训练获得其编码。因此,现有技术提供的文本编码方法存在无法对不定长文本数据进行准确编码的缺陷。
发明内容
本申请实施例提供了一种文本语义编码方法及装置,旨在解决现有技术存在的无法不定长文本数据进行准确编码的技术问题。
为此,本申请实施例提供如下技术方案:
本申请实施例的第一方面公开了一种文本语义编码方法,包括:根据文本数据生成词向量矩阵;将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
本申请实施例的第二方面,公开了一种文本语义编码装置,包括:词向量矩阵生成单元,用于根据文本数据生成词向量矩阵;预处理单元,用于将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;卷积处理单元,用于对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;池化处理单元,用于对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
本申请实施例的第三方面,公开了一种用于文本语义编码的装置,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:根据文本数据生成词向量矩阵;将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
本申请实施例的第四方面,公开了一种机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行如第一方面所述的文本语义编码方法。
本申请实施例提供的文本语义编码方法及装置,可以对来自于不同数据源的不定长 文本数据处理,生成词向量矩阵,,将词向量矩阵输入双向循环神经网络进行预处理,然后对循环神经网络的输出进行线性卷积操作及池化操作,最终得到一个固定长度的浮点值向量,作为不定长文本数据的语义编码,以用于后续的文本挖掘任务。本申请实施例可以挖掘文本的语义关系以及文本与主题的关联,实现了不定长文本数据的固定语义编码。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例一个应用场景示意图;
图2为本申请一实施例提供的文本语义编码方法流程图;
图3为本申请另一实施例提供的文本语义编码方法示意图;
图4为本申请一实施例提供的文本语义编码装置示意图;
图5是根据一示例性实施例示出的一种用于文本语义编码装置的框图;
图6为本申请又一实施例提供的文本语义编码方法流程图;
图7为本申请又一实施例提供的文本语义编码装置示意图。
具体实施方式
本申请实施例提供了一种文本语义编码方法及装置,可以实现不定长文本数据的文本语义编码。
在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
其中,所涉及的技术术语“文本编码”一般是指不定长自然语言文本的向量化表达。在本申请实施例中,通过文本编码可以将一段长度不确定的自然语言文本标识为一个固定长度的浮点值向量。
当然,上述术语的解释仅为方便理解而做出,而不具有任何限制含义。
参见图1,为本申请实施例的示例性应用场景。本申请实施例提供的方法可以应用于如图1所示的场景,实现文本的语义编码。当然,本申请实施例还可以应用到其他场景中,在此不进行限制。如图1所示,在本申请一个示例性应用场景中,可以通过电子设备100采集文本数据,例如不定长文本1、不定长文本2、不定长文3、不定长文本4,各文本数据的长度不尽相同。将采集的各文本数据到文本语义编码装置400后,即通过分词、词向量矩阵生成、双向循环神经网络预处理、卷积、池化处理操作,生成了固定长度的语义编码。其中,文本语义编码1、2、3、4的长度均是相同的,这样即实现了不定长文本数据到固定长度的文本语义编码的转换,并可以通过文本语义编码表征文本所反映的主题,为后续的数据挖掘提供了基础。
需要注意的是,上述应用场景仅是为了便于理解本申请而示出,本申请的实施方式在此方面不受任何限制。相反,本申请的实施方式可以应用于适用的任何场景。
下面将结合附图2、3、6对本申请示例性实施例示出的文本语义编码方法进行介绍。
参见图2,为本申请一实施例提供的文本语义编码方法流程图。如图2所示,可以包括:
S201,根据文本数据生成词向量矩阵。
其中,S201又可以包括以下步骤:
S201A,获取文本数据。
具体实现时,可以采集不同数据源的文本数据作为文本数据。以问答系统为例,可以采用用户输入的问题作为文本数据,例如,用户输入的问题为:“这个功能怎么使用啊?”当然,也可以采集问答系统中客服的反馈作为文本数据,例如,客服反馈的文本为:“商品分享功能的操作步骤为:登录淘宝账号,打开商品页面,点击分享按钮,选择支付宝好友,点击发送按钮即可完成商品分享功能。”当然,也可以采集其他文本数据作为文本数据,在此不进行限定。
其中,所述文本数据为不定长文本数据。也就是说,该文本数据的长度不是固定的,可以是任意自然语言文本。
S201B,对所述文本数据进行分词处理,获得词语序列。
对于输入文本进行分词处理,获得的词语序列可以表示为:
[w 1,...,w i...w |s|]
其中,w i表示输入文本中的第i个分词后的词语,|s|表示分词后的文本长度。例如,文本数据“这个功能怎么使用啊”经过分词处理后可以表示为[这个,功能,怎么,使用, 啊],词语序列的长度即为5,表示由5个词组成。
S201C,确定所述词语序列中的每个词语对应的词向量,生成词向量矩阵。
对于上述词语序列,使用词向量word embedding进行编码可以得到词向量矩阵:
[v 1,...,v i...v |s|]
其中,第i个词语对应的词向量v i=LT W(w i)
W∈R d×|v|表示预训练的词向量word embedding矩阵,|v|表示词向量矩阵中的词语个数,d表示词向量word embedding编码长度,R代表实数空间,LT代表lookup table函数。该矩阵的每一列代表一个单词的word embedding编码。基于此,任意文本可表示成d×|s|的矩阵S。其中,S用于表示由输入文本中的词语对应的词向量所构成的矩阵。
需要说明的是,词向量word embedding是一种自然语言处理编码技术,该技术可生成一个|v|*d大小的词向量矩阵,该矩阵的每一列代表一个词语,比如“怎么”,这一列向量即代表对“怎么”这个词的编码,|v|即代表词典里词语的个数,d表示编码向量的长度。对于一句话,比如“这个功能怎么使用啊”,会先分词为“这个功能怎么使用啊”,然后针对每个词去查对应的编码向量,比如,“这个”对应的向量为[-0.01,0.03,0.02,...,0.06],这五个词分别有自己的向量表达,五个向量组合在一起,就是一个矩阵,代表这个句子。
S202,将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量。
在一些实施方式中,所述将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量包括:将所述词向量矩阵输入双向循环神经网络,采用长短期记忆网络LSTM算子进行计算,通过前向处理得到各词向量与上文的语义依赖关系,通过后向处理得到个词向量与下文的语义依赖关系,将各词向量与上、下文的语义依赖关系作为输出向量。
举例说明,对于矩阵S203生成的词向量矩阵S,可以采用双向循环神经网络进行预处理。网络的计算单元采用LSTM(Long-Short Term Memory)算子。双向循环神经网络包括前向过程(处理顺序为w 1→w |S|)和后向过程(处理顺序为w |S|→w 1)。针对每一个输入向量v i,前向过程会生成一个输出向量
Figure PCTCN2018111628-appb-000001
相应的后向过程同样会生成一个输出向量
Figure PCTCN2018111628-appb-000002
这些向量蕴含着对应的单词w i及其上文(对于前向过程)或下文(对于后向过程)的语义信息。然后,采用如下公式进行处理:
Figure PCTCN2018111628-appb-000003
其中,h i作为相应的w i的中间编码。
Figure PCTCN2018111628-appb-000004
为双向循环神经网络的前向过程中,针对输 入词语i处理生成的向量,用于表示词语i与上文的语义依赖关系;
Figure PCTCN2018111628-appb-000005
为双向循环神经网络的反向过程中,针对输入词语i处理生成的向量,用于表示词语i与下文的语义依赖关系。
S203,对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关。
其中,所述对所述输出向量进行卷积操作,获得卷积处理结果包括:
S203A,采用卷积核对所述输出向量进行线性卷积操作;所述卷积核与主题相关。
具体实现时,可以采用一个卷积核F∈R d×m(m为卷积窗口的大小)对H∈R 2d×|S|进行线性卷积操作,得到向量C∈R |S|-m+1,其中:
c i=(H*F) i=∑(H :,i:i+m-1·F)
其中,卷积核F与主题相关。
在一些实施方式中,所述采用卷积核对所述输出向量进行线性卷积操作包括:采用一组卷积核F,利用以下公式对所述输出向量H进行卷积操作:
c ji=∑(H :,i:i+m-1·F j)+b i
其中,c ji为卷积操作的结果向量,H为双向循环神经网络的输出向量,F j为第j个卷积核,b i为卷积核F j对应的偏倚值,i为整数,j为整数,m为卷积窗口大小。
实际应用中,通常采用一组卷积核F∈R n×d×m对H进行卷积操作,获得矩阵C∈R n×(|S|-m+1)。其中,C表示卷积操作的结果向量。此外,每个卷积核均对应一个偏倚值b i
具体实现时,在确定采用的卷积核时,需要确定该卷积核的大小。一般地,每个卷积核为一个二维向量,该向量的大小需要根据不同的应用场景调试获得,而向量的值则是通过监督学习获得。一般采用神经网络训练得到该卷积核,具体可以通过训练样本监督学习得到卷积核对应的向量。
S203B,对所述线性卷积操作结果进行非线性变换处理,获得卷积处理结果。
为使编码具有非线性表达能力,通常在卷积层上添加一个非线性激活函数,如softmax或Relu。以Relu为例,输出结果A∈R n×(|S|-m+1),其中:
a ij=max(0,c ij)
其中,A表示Relu处理后的结果变量。a ij表示A中的一个变量,经过上述处理,每个a ij均被处理为大于等于0的数值。
S204,对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
需要说明的是,在这一步骤中,对卷积处理结果进行最大池化操作处理,以消除所述结果的变长,获得固定长度的浮点值向量作为该文本数据的语义编码;其中,所述向量的每个数值用于表示该文本对主题的反映程度。
具体地,对S203得到的矩阵A采用最大池化操作进行处理。在文本编码处理中,池化操作起到了消除“变长”的作用。具体来讲,对于输入矩阵A,矩阵A的每一行,对应通过一个卷积核进行卷积操作所得的一个浮点值向量,取这个向量中的最大值,如下列公式所示:
p i=max(A i,:)
其中,最终的结果P∈R n作为目标文本的最终编码。
需要说明的是,结果向量P上的每一位代表一个“主题”,这一位上的取值代表对该“主题”的反映程度。
在获取文本数据对应的语义编码后,可以对该语义编码采取不同的处理。例如,由于所获取的文本语义编码为一个浮点值向量,可采用针对向量的常用操作来进行后续处理,如可以计算两个编码的余弦距离,即可表示两段文本的相似度。当然,本申请对获取文本数据的语义编码后,对文本语义编码的后续处理不进行限定。
参见图3,为本申请一实施例提供的文本语义编码方法示意图。如图3所示,对于目标文本“这个功能怎么使用啊”,经过分词处理后可以表示为[这个,功能,怎么,使用,啊]。对每个分词采用词向量进行编码,将词向量矩阵输入双向循环神经网络进行处理,获得输出结果;对输出结果进行线性卷积处理、非线性变换处理、采用最大池化操作消除“变长”,最终得到一个固定长度的向量作为该文本的语义编码。本申请实施例中可对不定长的文本数据进行处理,首先将其表示为词向量矩阵,再利用双向循环神经网络及卷积相关操作获取一个固定长度大小的浮点值编码,用作该文本的语义编码,实现了不定长文本数据到固定长度文本语义编码的转换,并挖掘了文本的语义关系以及主题表达。
参见图6,为本申请又一实施例提供的文本语义编码方法流程图。
S601,根据文本数据生成词向量矩阵。
其中,根据文本数据生成词向量矩阵可以包括:
S601A,获取文本数据。其中,所述文本数据具体为不定长文本数据。具体实现可 以参见图2所示的S201A而实现。
S601B,对所述文本数据进行分词处理,获得词语序列。具体实现可以参见图2所示的S201B而实现。
S601C,确定所述词语序列中的每个词语对应的词向量,生成词向量矩阵。具体实现可以参见图2所示的S201C而实现
S602,根据所述词向量矩阵,获得用于表示词语上下文语义关系的输出向量。
具体实现时,根据所述词向量矩阵,获得用于表示词语上下文语义关系的输出向量可以包括:将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量。进一步地,可以将所述词向量矩阵输入双向循环神经网络,采用长短期记忆网络LSTM算子进行计算,通过前向处理得到各词向量与上文的语义依赖关系,通过后向处理得到个词向量与下文的语义依赖关系,将各词向量与上、下文的语义依赖关系作为输出向量。当然,也可以采用其他方式获得输出向量,在此不进行限定。
S603,根据所述输出向量,获得与主题相关的卷积处理结果。
具体实现时,可以采用卷积核对所述输出向量进行线性卷积操作;所述卷积核与主题相关;对所述线性卷积操作结果进行非线性变换处理,获得卷积处理结果。
S604,根据所述卷积处理结果,获得固定长度的向量作为所述文本数据的语义编码,以用于表征所述文本数据的主题。
具体实现时,可以对卷积处理结果进行最大池化操作处理,以消除所述结果的变长,获得固定长度的浮点值向量作为该文本数据的语义编码;其中,所述向量的每个数值用于表示该文本对主题的反映程度。
下面对本申请实施例提供的方法对应的设备进行介绍。
参见图4,为本申请一实施例提供的文本语义编码装置示意图。
一种文本语义编码装置400,包括:
词向量矩阵生成单元401,用于根据文本数据,生成词向量矩阵。其中,所述词向量矩阵生成单元401的具体实现可以参照图2所示实施例的S201而实现。
预处理单元402,用于将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量。其中,所述预处理单元402的具体实现可以参照图2所示实施例的S202而实现。
卷积处理单元403,用于对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;其中,所述卷积处理单元403的具体实现可以参照图2所示实施例的S203而实现。
池化处理单元404,用于对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。其中,所述池化处理单元404的具体实现可以参照图2所示实施例的S204而实现。
在一些实施方式中,所述词向量矩阵生成单元401具体可以包括:获取单元,用于获取文本数据。其中,所述获取单元的具体实现可以参照图2所示实施例的S201A而实现。
分词单元,用于对所述文本数据进行分词处理,获得词语序列。其中,所述分词单元的具体实现可以参照图2所示实施例的S201B而实现。
矩阵生成单元,用于确定所述词语序列中的每个词语对应的词向量,生成词向量矩阵。其中,所述矩阵生成单元的具体实现可以参照图2所示实施例的S201C而实现。
在一些实施方式中,所述预处理单元具体用于:将所述词向量矩阵输入双向循环神经网络,采用长短期记忆网络LSTM算子进行计算,通过前向处理得到各词向量与上文的语义依赖关系,通过后向处理得到个词向量与下文的语义依赖关系,将各词向量与上、下文的语义依赖关系作为输出向量。
在一些实施方式中,所述卷积处理单元包括:
卷积单元,用于采用卷积核对所述输出向量进行线性卷积操作;所述卷积核与主题相关;
非线性变换单元,用于对所述线性卷积操作结果进行非线性变换处理,获得卷积处理结果。
在一些实施方式中,所述卷积单元具体用于:采用一组卷积核F,利用以下公式对所述输出向量H进行卷积操作:
c ji=∑(H :,i:i+m-1·F j)+b i
其中,c ji为卷积操作的结果向量,H为双向循环神经网络的输出向量,F j为第j个卷积核,b i为卷积核F j对应的偏倚值,i为整数,j为整数,m为卷积窗口大小。
在一些实施方式中,所述池化单元具体用于对卷积处理结果进行最大池化操作处理,以消除所述结果的变长,获得固定长度的浮点值向量作为该文本数据的语义编码;其中, 所述向量的每个数值用于表示该文本对主题的反映程度。
参见图5,是本申请另一实施例提供的文本语义编码的装置的框图。包括:至少一个处理器501(例如CPU),存储器502和至少一个通信总线503,用于实现这些装置之间的连接通信。处理器501用于执行存储器502中存储的可执行模块,例如计算机程序。存储器502可能包含高速随机存取存储器(RAM:Random Access Memory),也可能还包括非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器501执行所述一个或者一个以上程序包含用于进行以下操作的指令:根据文本数据,生成词向量矩阵;将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
在一些实施方式中,处理器501具体用于执行所述一个或者一个以上程序包含用于进行以下操作的指令:将所述词向量矩阵输入双向循环神经网络,采用长短期记忆网络LSTM算子进行计算,通过前向处理得到各词向量与上文的语义依赖关系,通过后向处理得到个词向量与下文的语义依赖关系,将各词向量与上、下文的语义依赖关系作为输出向量。
在一些实施方式中,处理器501具体用于执行所述一个或者一个以上程序包含用于进行以下操作的指令:采用卷积核对所述输出向量进行线性卷积操作;所述卷积核与主题相关;对所述线性卷积操作结果进行非线性变换处理,获得卷积处理结果。
在一些实施方式中,处理器501具体用于执行所述一个或者一个以上程序包含用于进行以下操作的指令:对卷积处理结果进行最大池化操作处理,以消除所述结果的变长,获得固定长度的浮点值向量作为该文本数据的语义编码;其中,所述向量的每个数值用于表示该文本对主题的反映程度。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器,上述指令可由装置的处理器执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
一种机器可读介质,例如该机器可读介质可以为非临时性计算机可读存储介质,当 所述介质中的指令由装置(终端或者服务器)的处理器执行时,使得装置能够执行一种文本语义编码方法,所述方法包括:根据文本数据,生成词向量矩阵;将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
参见图7,为本申请另一实施例提供的文本语义编码装置示意图。
一种文本语义编码装置700,包括:
词向量矩阵生成单元701,用于根据文本数据生成词向量矩阵。其中,词向量矩阵生成单元701的具体实现可以参照图6所示实施例的S601而实现。
输出向量获得单元702,用于根据所述词向量矩阵,获得用于表示词语上下文语义关系的输出向量。其中,输出向量获得单元702的具体实现可以参照图6所示实施例的S602而实现。
卷积处理单元703,用于根据所述输出向量,获得与主题相关的卷积处理结果。其中,卷积处理单元703的具体实现可以参照图6所示实施例的S603而实现。
语义编码获得单元704,用于根据所述卷积处理结果,获得固定长度的向量作为所述文本数据的语义编码,以用于表征所述文本数据的主题。其中,语义编码获得单元704的具体实现可以参照图6所示实施例的S604而实现。
其中,本申请装置各单元或模块的设置可以参照图2、图3和图6所示的方法而实现,在此不赘述。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在 涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。以上所述仅是本申请的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (11)

  1. 一种文本语义编码方法,其特征在于,包括:
    根据文本数据生成词向量矩阵;
    将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;
    对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;
    对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量包括:
    将所述词向量矩阵输入双向循环神经网络,采用长短期记忆网络LSTM算子进行计算,通过前向处理得到各词向量与上文的语义依赖关系,通过后向处理得到个词向量与下文的语义依赖关系,将各词向量与上、下文的语义依赖关系作为输出向量。
  3. 根据权利要求1所述的方法,其特征在于,所述对所述输出向量进行卷积操作,获得卷积处理结果包括:
    采用卷积核对所述输出向量进行线性卷积操作;所述卷积核与主题相关;
    对所述线性卷积操作结果进行非线性变换处理,获得卷积处理结果。
  4. 根据权利要求1所述的方法,其特征在于,所述对所述卷积处理结果进行最大池化操作包括:
    对卷积处理结果进行最大池化操作处理,以消除所述结果的变长,获得固定长度的浮点值向量作为该文本数据的语义编码;其中,所述向量的每个数值用于表示该文本对主题的反映程度。
  5. 根据权利要求1所述的方法,其特征在于,所述文本数据为不定长文本数据。
  6. 一种文本语义编码方法,其特征在于,包括:
    根据文本数据生成词向量矩阵;
    根据所述词向量矩阵,获得用于表示词语上下文语义关系的输出向量;
    根据所述输出向量,获得与主题相关的卷积处理结果;
    根据所述卷积处理结果,获得固定长度的向量作为所述文本数据的语义编码,以用于表征所述文本数据的主题。
  7. 一种文本语义编码装置,其特征在于,包括:
    词向量矩阵生成单元,用于根据文本数据生成词向量矩阵;
    预处理单元,用于将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;
    卷积处理单元,用于对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;
    池化处理单元,用于对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
  8. 一种文本语义编码装置,其特征在于,包括:
    词向量矩阵生成单元,用于根据文本数据生成词向量矩阵;
    输出向量获得单元,用于根据所述词向量矩阵,获得用于表示词语上下文语义关系的输出向量;
    卷积处理单元,用于根据所述输出向量,获得与主题相关的卷积处理结果;
    语义编码获得单元,用于根据所述卷积处理结果,获得固定长度的向量作为所述文本数据的语义编码,以用于表征所述文本数据的主题。
  9. 一种用于文本语义编码的装置,其特征在于,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:
    根据文本数据生成词向量矩阵;
    将所述词向量矩阵输入双向循环神经网络进行预处理操作,获得用于表示词语上下文语义关系的输出向量;
    对所述输出向量进行卷积操作,获得卷积处理结果;所述卷积处理结果与主题相关;
    对所述卷积处理结果进行池化操作,以获得固定长度的向量作为所述文本数据的语义编码,所述语义编码用于表征所述文本数据的主题。
  10. 一种用于文本语义编码的装置,其特征在于,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:
    根据文本数据生成词向量矩阵;
    根据所述词向量矩阵,获得用于表示词语上下文语义关系的输出向量;
    根据所述输出向量,获得与主题相关的卷积处理结果;
    根据所述卷积处理结果,获得固定长度的向量作为所述文本数据的语义编码,以用 于表征所述文本数据的主题。
  11. 一种机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行如权利要求1至5中一个或多个所述的文本语义编码方法。
PCT/CN2018/111628 2017-10-27 2018-10-24 一种文本语义编码方法及装置 WO2019080864A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2020520227A JP2021501390A (ja) 2017-10-27 2018-10-24 テキスト意味論的コード化の方法および装置
US16/754,832 US20200250379A1 (en) 2017-10-27 2018-10-24 Method and apparatus for textual semantic encoding

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711056845.2 2017-10-27
CN201711056845.2A CN110019793A (zh) 2017-10-27 2017-10-27 一种文本语义编码方法及装置

Publications (1)

Publication Number Publication Date
WO2019080864A1 true WO2019080864A1 (zh) 2019-05-02

Family

ID=66247156

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/111628 WO2019080864A1 (zh) 2017-10-27 2018-10-24 一种文本语义编码方法及装置

Country Status (5)

Country Link
US (1) US20200250379A1 (zh)
JP (1) JP2021501390A (zh)
CN (1) CN110019793A (zh)
TW (1) TW201917602A (zh)
WO (1) WO2019080864A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052687A (zh) * 2020-09-02 2020-12-08 厦门市美亚柏科信息股份有限公司 基于深度可分离卷积的语义特征处理方法、装置及介质
CN117521652A (zh) * 2024-01-05 2024-02-06 一站发展(北京)云计算科技有限公司 基于自然语言模型的智能匹配系统及方法

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11250221B2 (en) * 2019-03-14 2022-02-15 Sap Se Learning system for contextual interpretation of Japanese words
CN112396484A (zh) * 2019-08-16 2021-02-23 阿里巴巴集团控股有限公司 商品的验证方法及装置、存储介质和处理器
CN110705268A (zh) * 2019-09-02 2020-01-17 平安科技(深圳)有限公司 基于人工智能的文章主旨提取方法、装置及计算机可读存储介质
CN112579730A (zh) * 2019-09-11 2021-03-30 慧科讯业有限公司 高扩展性、多标签的文本分类方法和装置
CN110826298B (zh) * 2019-11-13 2023-04-04 北京万里红科技有限公司 一种智能辅助定密系统中使用的语句编码方法
CN110889290B (zh) * 2019-11-13 2021-11-16 北京邮电大学 文本编码方法和设备、文本编码有效性检验方法和设备
CN112287672A (zh) * 2019-11-28 2021-01-29 北京京东尚科信息技术有限公司 文本意图识别方法及装置、电子设备、存储介质
US11544946B2 (en) * 2019-12-27 2023-01-03 Robert Bosch Gmbh System and method for enhancing neural sentence classification
CN111160042B (zh) * 2019-12-31 2023-04-28 重庆觉晓科技有限公司 一种文本语义解析方法和装置
CN111259162B (zh) * 2020-01-08 2023-10-03 百度在线网络技术(北京)有限公司 对话交互方法、装置、设备和存储介质
CN112069827B (zh) * 2020-07-30 2022-12-09 国网天津市电力公司 一种基于细粒度主题建模的数据到文本生成方法
CN112232089B (zh) * 2020-12-15 2021-04-06 北京百度网讯科技有限公司 语义表示模型的预训练方法、设备和存储介质
CN112686050B (zh) * 2020-12-27 2023-12-05 北京明朝万达科技股份有限公司 基于潜在语义索引的上网行为分析方法、系统和介质
CN112800183B (zh) * 2021-02-25 2023-09-26 国网河北省电力有限公司电力科学研究院 内容名称数据处理方法及终端设备
CN113110843B (zh) * 2021-03-05 2023-04-11 卓尔智联(武汉)研究院有限公司 合约生成模型训练方法、合约生成方法及电子设备
CN113033150A (zh) * 2021-03-18 2021-06-25 深圳市元征科技股份有限公司 一种程序文本的编码处理方法、装置以及存储介质
CN115713079A (zh) * 2021-08-18 2023-02-24 北京京东方技术开发有限公司 用于自然语言处理、训练自然语言处理模型的方法及设备
CN113724882A (zh) * 2021-08-30 2021-11-30 康键信息技术(深圳)有限公司 基于问诊会话构建用户画像的方法、装置、设备和介质
CN115146488B (zh) * 2022-09-05 2022-11-22 山东鼹鼠人才知果数据科技有限公司 基于大数据的可变业务流程智能建模系统及其方法
CN116663568B (zh) * 2023-07-31 2023-11-17 腾云创威信息科技(威海)有限公司 基于优先级的关键任务识别系统及其方法
CN117574922A (zh) * 2023-11-29 2024-02-20 西南石油大学 一种基于多通道模型的口语理解联合方法及口语理解系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061250A1 (en) * 2015-08-28 2017-03-02 Microsoft Technology Licensing, Llc Discovery of semantic similarities between images and text
CN106547885A (zh) * 2016-10-27 2017-03-29 桂林电子科技大学 一种文本分类系统及方法
CN106980683A (zh) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 基于深度学习的博客文本摘要生成方法
CN107169035A (zh) * 2017-04-19 2017-09-15 华南理工大学 一种混合长短期记忆网络和卷积神经网络的文本分类方法
CN107229684A (zh) * 2017-05-11 2017-10-03 合肥美的智能科技有限公司 语句分类方法、系统、电子设备、冰箱及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7859036B2 (en) * 2007-04-05 2010-12-28 Micron Technology, Inc. Memory devices having electrodes comprising nanowires, systems including same and methods of forming same
CN101727500A (zh) * 2010-01-15 2010-06-09 清华大学 一种基于流聚类的中文网页文本分类方法
US10445356B1 (en) * 2016-06-24 2019-10-15 Pulselight Holdings, Inc. Method and system for analyzing entities
CN106407903A (zh) * 2016-08-31 2017-02-15 四川瞳知科技有限公司 基于多尺度卷积神经网络的实时人体异常行为识别方法
US10643120B2 (en) * 2016-11-15 2020-05-05 International Business Machines Corporation Joint learning of local and global features for entity linking via neural networks
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法
US20180260414A1 (en) * 2017-03-10 2018-09-13 Xerox Corporation Query expansion learning with recurrent networks
US9959272B1 (en) * 2017-07-21 2018-05-01 Memsource a.s. Automatic classification and translation of written segments

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061250A1 (en) * 2015-08-28 2017-03-02 Microsoft Technology Licensing, Llc Discovery of semantic similarities between images and text
CN106547885A (zh) * 2016-10-27 2017-03-29 桂林电子科技大学 一种文本分类系统及方法
CN106980683A (zh) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 基于深度学习的博客文本摘要生成方法
CN107169035A (zh) * 2017-04-19 2017-09-15 华南理工大学 一种混合长短期记忆网络和卷积神经网络的文本分类方法
CN107229684A (zh) * 2017-05-11 2017-10-03 合肥美的智能科技有限公司 语句分类方法、系统、电子设备、冰箱及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052687A (zh) * 2020-09-02 2020-12-08 厦门市美亚柏科信息股份有限公司 基于深度可分离卷积的语义特征处理方法、装置及介质
CN112052687B (zh) * 2020-09-02 2023-11-21 厦门市美亚柏科信息股份有限公司 基于深度可分离卷积的语义特征处理方法、装置及介质
CN117521652A (zh) * 2024-01-05 2024-02-06 一站发展(北京)云计算科技有限公司 基于自然语言模型的智能匹配系统及方法
CN117521652B (zh) * 2024-01-05 2024-04-12 一站发展(北京)云计算科技有限公司 基于自然语言模型的智能匹配系统及方法

Also Published As

Publication number Publication date
US20200250379A1 (en) 2020-08-06
TW201917602A (zh) 2019-05-01
CN110019793A (zh) 2019-07-16
JP2021501390A (ja) 2021-01-14

Similar Documents

Publication Publication Date Title
WO2019080864A1 (zh) 一种文本语义编码方法及装置
US11755885B2 (en) Joint learning of local and global features for entity linking via neural networks
US11651236B2 (en) Method for question-and-answer service, question-and-answer service system and storage medium
CN108334487B (zh) 缺失语意信息补全方法、装置、计算机设备和存储介质
CN107273503B (zh) 用于生成同语言平行文本的方法和装置
CN107066464B (zh) 语义自然语言向量空间
US10650311B2 (en) Suggesting resources using context hashing
CN108205699B (zh) 生成用于神经网络输出层的输出
US10606946B2 (en) Learning word embedding using morphological knowledge
CN112860866B (zh) 语义检索方法、装置、设备以及存储介质
WO2015160544A1 (en) Context-sensitive search using a deep learning model
CN109858045B (zh) 机器翻译方法和装置
CN114861889B (zh) 深度学习模型的训练方法、目标对象检测方法和装置
CN111611452B (zh) 搜索文本的歧义识别方法、系统、设备及存储介质
US11651015B2 (en) Method and apparatus for presenting information
CN111488742B (zh) 用于翻译的方法和装置
CN114385780B (zh) 程序接口信息推荐方法、装置、电子设备和可读介质
CN114912450B (zh) 信息生成方法与装置、训练方法、电子设备和存储介质
CN111368551A (zh) 一种确定事件主体的方法和装置
CN107766498B (zh) 用于生成信息的方法和装置
CN113268560A (zh) 用于文本匹配的方法和装置
US20230124572A1 (en) Translation of text depicted in images
CN110222144B (zh) 文本内容提取方法、装置、电子设备及存储介质
CN111241843B (zh) 基于复合神经网络的语义关系推断系统和方法
CN111368554B (zh) 语句处理方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020520227

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18871049

Country of ref document: EP

Kind code of ref document: A1