WO2017162134A1 - 用于文本处理的电子设备和方法 - Google Patents

用于文本处理的电子设备和方法 Download PDF

Info

Publication number
WO2017162134A1
WO2017162134A1 PCT/CN2017/077473 CN2017077473W WO2017162134A1 WO 2017162134 A1 WO2017162134 A1 WO 2017162134A1 CN 2017077473 W CN2017077473 W CN 2017077473W WO 2017162134 A1 WO2017162134 A1 WO 2017162134A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
vector
text vector
feature representation
correlation
Prior art date
Application number
PCT/CN2017/077473
Other languages
English (en)
French (fr)
Inventor
吴友政
祁均
Original Assignee
索尼公司
吴友政
祁均
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 索尼公司, 吴友政, 祁均 filed Critical 索尼公司
Priority to US16/080,670 priority Critical patent/US10860798B2/en
Priority to EP17769411.4A priority patent/EP3435247A4/en
Priority to CN201780007352.1A priority patent/CN108475262A/zh
Publication of WO2017162134A1 publication Critical patent/WO2017162134A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the field of natural language processing, and more particularly to an electronic device and method for text processing that constructs a multi-view word feature representation model based on correlation between two or more word feature representation models To achieve a deep shared perspective representation of the features of the text object, thereby facilitating subsequent natural language processing.
  • NLU Natural Language Understanding
  • Beijing is [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0...]
  • the dimension of the discrete vector is the size of the dictionary, so the dimension is usually higher.
  • This simple discrete representation method has well completed natural language processing (Natural Language Processing) by matching statistical models such as Maximum Entropy, Support Vector Machine (SVM), and Condition Random Field (CRF).
  • SVM Support Vector Machine
  • CRF Condition Random Field
  • Various mainstream tasks in the field of NLP such as Part-of-Speech Tagging, Slot Filling, Named Entity Recognition, etc.
  • word embedding simply means that discrete text (for example, words, phrases, or sentences) is represented as low-dimensional space.
  • vector For example, word vector representations using word embedding techniques are typically for example:
  • Beijing is [0.01, -0.8, -0.5, 0.123, -0.142,...]
  • word embedding technology the latitude of word vectors is more common in 50-dimensional, 100-dimensional, and 300-dimensional. Since the word embedding technology considers the semantic relationship between individual texts, the vector representation of each word is not completely independent but has a certain semantic association, so that not only the dimension of the word vector representation is greatly reduced, but the computational complexity is reduced. And also makes such word vector representation more conducive to tasks in natural language processing and oral understanding.
  • Word2vec and GloVe are several word embedding technologies that have been widely used in recent years. With the development of deep learning, word embedding technology has become an indispensable branch of natural language processing and oral understanding, and the technology has achieved certain success.
  • Word2Vec relies on skip-grams or Continuous Bag of Words (CBOX) models to create word vectors to obtain long word contexts
  • GloVe is a non-zero term based on global word co-occurrence matrices. Trained, this requires traversing the entire corpus to collect statistical information.
  • the word characterization of the training corpus for news reports and the training corpus training for daily spoken language indicates that there is a bias in the semantic association between the various words and has limitations.
  • an object of the present disclosure to provide an electronic device and method for text processing that provides the text from a plurality of perspectives based on the correlation between different text feature representations of text objects represented by different perspectives.
  • Object-wide depth-sharing view feature representation to optimize execution of natural language System performance when dealing with tasks such as spoken language understanding.
  • an electronic device for text processing including a processor configured to: determine a correlation between a first text vector and a second text vector, a text vector and a second text vector are multi-dimensional real numbers generated based on the same text, respectively; and a third text vector is obtained according to the correlation for representing the text, wherein the vector space in which the third text vector is located and the first text vector Corresponding to the vector space in which the second text vector is located.
  • the text corresponds to a word.
  • the text corresponds to one of: a phrase consisting of a plurality of words; and a sentence composed of a plurality of phrases.
  • the first text vector and the second text vector are based on the first word feature representation model and the second word feature representation model, respectively.
  • the first word feature representation model and the second word feature representation model are derived based on different word feature representation training mechanisms, respectively.
  • the word feature representation training mechanism includes at least one of the following: a Word2Vec mechanism, a GloVe mechanism, and a C&W mechanism.
  • the processor is further configured to: determine a correlation between the first text vector and the second text vector based on the typical correlation analysis, and adjust the target to satisfy the predetermined condition Typical correlation analysis parameters.
  • the processor is further configured to: process the first text vector and the second text vector using a neural network to obtain a variable of the first text vector and a variable of the second text vector, based on the The variables of one text vector and the variables of the second text vector determine the correlation, and the parameters of the neural network are adjusted with the goal of making the correlation satisfy the predetermined condition.
  • the processor is further configured to: process the variables of the first text vector and the variables of the second text vector to reconstruct the first text vector and the second text vector using an automatic encoder, And adjusting the parameters of the automatic encoder and the neural network to determine the correlation by further making the error between the reconstructed first text vector and the second text vector and the first text vector and the second text vector satisfy a predetermined condition Sex.
  • the processor is further configured to determine a correlation between the respective first text vector and the second text vector for the plurality of texts and obtain a corresponding third A text vector
  • the electronic device further includes a memory configured to store a third text vector of the plurality of texts for establishing a multi-view text feature representation model.
  • the processor is further configured to determine a respective first text vector and a second text vector of the text based on the relevance of the other text for each of the plurality of texts The correlation between them.
  • a method for text processing comprising: determining a correlation between a first text vector and a second text vector, the first text vector and the second text vector being a multi-dimensional real number vector generated based on the same text respectively; and obtaining a third text vector for representing the text according to the correlation, wherein the vector space where the third text vector is located and the vector space where the first text vector and the second text vector are located Related.
  • an electronic device for text processing comprising: a memory configured to store a multi-view text feature representation model, wherein the multi-view text feature representation model is utilized
  • the processor is configured to read the multi-view text feature representation model from the memory, and map the text object to be processed into a corresponding multi-dimensional real number vector based on the multi-view text feature representation model.
  • a method for text processing comprising: reading a multi-view text feature representation model from a memory, wherein the multi-view text feature representation model is established using the method described above And mapping the text object to be processed to a corresponding multi-dimensional real number vector based on the multi-view text feature representation model.
  • the deficiencies of the single-view text feature representation model in the prior art can be overcome, thereby being able to improve the application to the natural Performance when language processing.
  • FIG. 1 is a block diagram showing a functional configuration example of an electronic device for text processing according to an embodiment of the present disclosure
  • CCA Canonical Correlation Analysis
  • FIG. 3 is a schematic diagram showing an implementation of further applying a neural network to determine the correlation between text vectors for the scheme shown in FIG. 2;
  • FIG. 4 is a schematic diagram showing an implementation of further applying an automatic encoder to determine the correlation between text vectors for the scheme shown in FIG. 3;
  • FIG. 5 is a block diagram showing a functional configuration example of an electronic device for text processing according to an embodiment of the present disclosure
  • FIG. 6 is a flowchart illustrating a process example of a method for text processing according to an embodiment of the present disclosure
  • FIG. 7 is a flowchart illustrating an example of a procedure of a method for text processing according to an embodiment of the present disclosure
  • FIG. 8 is a block diagram showing an example structure of a personal computer which is an information processing device which can be employed in the embodiment of the present disclosure.
  • FIG. 1 is a block diagram showing a functional configuration example of an electronic device for text processing according to an embodiment of the present disclosure.
  • the electronic device 100 may include a correlation determining unit 102 and a text vector generating unit 104.
  • the correlation determining unit 102 and the text vector generating unit 104 herein may be separate physical entities or logical entities, or may also be the same physical entity (eg, central processing unit (CPU), large scale integrated circuit (ASIC). Etc.) to achieve.
  • CPU central processing unit
  • ASIC large scale integrated circuit
  • the correlation determination unit 102 can be configured to determine a correlation between the first text vector and the second text vector, wherein the first text vector and the second text vector are multi-dimensional real numbers generated based on the same text, respectively.
  • the text may be, for example, a word, a phrase consisting of a plurality of words, or a sentence consisting of a plurality of phrases.
  • the first text vector and the second text vector are respectively based on the first word feature representation model and the second word feature representation model, and the two word feature representation models are word features respectively established from different perspectives.
  • the two word feature representation models are respectively obtained based on different word feature representation training mechanisms.
  • the word feature representation training mechanism herein may include at least one of a Word2Vec mechanism, a GloVe mechanism, and a C&W mechanism, ie, Two of the three training mechanisms are selected as training mechanisms for the first word feature representation model and the second feature model, respectively. These mechanisms are common word embedding techniques used in the prior art and will not be described in detail herein.
  • the first word feature representation model is obtained based on the Word2Vec mechanism
  • the second word feature representation model is obtained based on the GloVe mechanism. It can be understood that as the technology develops and improves, other mainstream word feature representation training mechanisms may appear, and those skilled in the art obviously can also combine the word features obtained based on the other two mainstream word feature representation training mechanisms according to the concept of the present disclosure. Represents a model.
  • the two word feature representation models are based on different training corpora, respectively. (corpus) obtained.
  • corpus obtained.
  • the first word feature representation model is based on general corpus (eg, large-scale news corpus or web page text), while the second word feature representation model is based on user-specific corpus (eg, mail corpus, spoken corpus, etc.) training.
  • the obtained training mechanism for training the first word feature representation model and the second word feature representation model may be the same or different.
  • the above two word feature representation models are, in one example, those performed by the person performing the technical solution of the present disclosure according to the corresponding training mechanism and corpus training (online or offline form), for example, processing tasks according to their specific language.
  • the targeted training is obtained, in another example, directly obtained from the outside, for example, a word feature representation model that has been trained by another person obtained from an academic research sharing platform as a word feature representation model to be merged.
  • the present disclosure mainly takes the fusion of two word feature representation models as an example, however, those skilled in the art can understand that, according to actual needs, fusion of more than two word feature representation models can also be performed based on the present disclosure, for example,
  • the first and second word feature representation models are merged, and the third word feature representation model obtained by the fusion is further merged with the fourth word feature representation model according to the solution of the present disclosure;
  • the scheme integrates the first and second word feature representation models to obtain a third word feature representation model, and simultaneously fuses the fourth and fifth word feature representation models to obtain a sixth word feature representation model, and then the third word
  • the feature representation model is fused with the sixth word feature representation model and will not be described here.
  • the correlation determining unit 102 may be further configured to determine a correlation between the first text vector and the second text vector based on a typical correlation analysis (CCA), and adjust the target to satisfy the predetermined condition.
  • Typical correlation analysis parameters Canonical Correlation Analysis (CCA) is a commonly used statistical analysis method for analyzing the correlation between two sets of variables, which is applied here to determine two sets of word feature representations (ie, word vectors) in word embedding techniques. Correlation between.
  • CCA Canonical Correlation Analysis
  • other correlation analysis methods including existing or future analysis methods that may occur, are utilized to determine the correlation between the first text vector and the second text vector.
  • CCA is the standard technique for unsupervised data analysis for finding the most linear projection of the correlation of two random vectors.
  • X 1 , X 2 whose covariance matrix is defined as ( ⁇ 11 , ⁇ 22 ) and the cross-covariance matrix is defined as ⁇ 12 .
  • (r 1 , r 2 ) > 0 is the two regular terms of the covariance matrices ⁇ 11 and ⁇ 22 to ensure the non-specificity of the sample covariance.
  • CCA tries to find a pair of linear projections with the greatest correlation between the two perspectives A 1 and A 2 As shown in the following expression (1):
  • Expression (1) is a classic semi-definite programming. Suppose the middle term is And let U k and V k be the first k left singular vectors and the first k right singular vectors of T, then the optimal solution is as shown in the following expression (2):
  • FIG. 2 is a schematic diagram showing an implementation of determining a correlation between text vectors based on a typical correlation analysis, in accordance with an embodiment of the present disclosure.
  • X and Y are the first text vector and the second text vector, respectively, and U and V are linear transformation parameters of the typical correlation analysis, respectively.
  • the parameters U and V are adjusted here, for example, such that the correlation between the linearly transformed first text vector (U T X) and the second text vector (V T Y) is the optimization target, ie Mathematically, it can be expressed, for example, that the values of the parameters U and V are determined by minimizing the covariance between U T X and V T Y as the optimization target, where ( ⁇ ) T represents the transpose of the matrix.
  • the parameters U and V of the typical correlation analysis are adjusted here by taking the highest correlation between U T X and V T Y as an example, the present disclosure is not limited thereto, but may be based on actual conditions.
  • the parameters of the typical correlation analysis are determined (eg, computing power, etc.) to meet the correlation of other predetermined conditions (eg, predetermined correlation thresholds, predetermined iterations, etc.), which also applies to the description in the subsequent embodiments. .
  • predetermined conditions eg, predetermined correlation thresholds, predetermined iterations, etc.
  • the specific process of determining the linear transformation parameters U and V according to the optimization objective function can be realized by those skilled in the art according to relevant mathematical knowledge, and will not be described in detail herein.
  • the text vector generation unit 104 can be configured to obtain a third text vector to represent the same text based on the determined correlation.
  • two text feature representations of the same text namely U T X, V T Y
  • U and V determined when the correlation satisfies the predetermined condition either of which can be
  • the third text vector can be represented, for example, as U T X, V T Y or a vector determined based on at least one of U T X and V T Y (eg, a weighted average equivalent transformation of the two) .
  • the generated vector space of the third text vector and the first text vector are The vector space in which the second text vector is located has a correlation.
  • the generated third text vector considers the correlation between the first text vector and the second text vector obtained based on different perspectives, so it is a multi-view and depth feature representation of the same text, which can improve the subsequent nature.
  • the performance of language processing may be to integrate and implement the existing at least two text feature representation models, and it is not necessary to re-integrate the two corporas for training.
  • the above-described correlation determining unit 102 may be further configured to respectively determine a correlation between the corresponding first text vector and the second text vector for the plurality of texts, and the text vector generating unit 104 may be further A third text vector is configured to be obtained from the correlation determined for each text.
  • the electronic device 100 can further include a memory 106 configurable to store a third text vector of the plurality of texts for establishing a multi-view text feature representation model, the multi-view text feature representation model representation from the text
  • a memory 106 configurable to store a third text vector of the plurality of texts for establishing a multi-view text feature representation model, the multi-view text feature representation model representation from the text
  • the correlation determining unit 102 may be further configured to be in a plurality of texts Each of the texts is also based on a correlation with other text to determine a correlation between the first text vector of the text and the second text vector.
  • the corresponding third text vector may be determined based only on the correlation between the first text vector of the text itself and the second text vector, and then according to the separately determined third text vectors.
  • the collection to create a new multi-view text feature representation model may be based.
  • the first text vector set and the second text vector for the text set may also be based.
  • the overall correlation between the sets to determine parameters for a typical correlation analysis of the set of texts ie, the relevance of other texts needs to be taken into account when determining the relevance for a particular text, thereby determining for the set of texts
  • the third set of text vectors is used to create a multi-view text feature representation model.
  • the specific implementation process of determining the correlation between the first text vector set and the second text vector set by using the textual collection as a whole by using the typical correlation analysis technique can be referred to the principle of the typical correlation analysis technique, and will not be described in detail herein.
  • the correlation is determined as a whole of a plurality of sets of text, but alternatively may be determined on a text-by-text basis. Correlation, so that the corresponding third text vector set is determined according to the correlation for establishing the multi-view text feature representation model, and a specific implementation manner may be selected by a person skilled in the art according to the actual situation, which is not limited in the disclosure.
  • the neural network can be further utilized to determine the above correlation.
  • 3 is a schematic diagram showing an implementation of further applying a neural network to determine the correlation between text vectors for the scheme shown in FIG. 2.
  • two independent Deep Neural Networks are further added to input the two text vectors X and Y (here X And Y can also represent a set of text vectors) for nonlinear transformation, and then use Canonical Correlation Analysis (CAA) to determine the correlation between the nonlinearly transformed vectors.
  • CAA Canonical Correlation Analysis
  • This scheme can also be referred to as depth typical correlation analysis (hereinafter). Deep Canonical Correlation Analysis, DCCA).
  • DCCA Deep Canonical Correlation Analysis
  • the correlation between the text vectors is determined by taking the combination of the deep neural network and the typical correlation analysis as an example, as described above, the combination of the deep neural network and other correlation analysis techniques may also be used to perform the correlation. determine.
  • the use of two independent deep neural networks for nonlinear transformation is to reduce the computational complexity, and of course, without considering the computational complexity.
  • a deep neural network is used to nonlinearly transform the first and second text vectors.
  • the meanings of the symbols X, Y, U, and V are the same as those described above with reference to FIG. 2, and are not repeated here, and f( ⁇ ) and g( ⁇ ) respectively represent two deep neural networks.
  • the nonlinear transformation has parameters W f and W g , respectively .
  • the first text vector X and the second text vector Y first pass through the deep neural network to accept the nonlinear transformation, and the variables of the transformed first text vector and the variables of the second text vector are respectively recorded as f(X) and g(Y).
  • F(X) and g(Y) are linearly transformed by CCA, respectively, and so that f(X) and g(Y) after linear transformation (ie, U T f(X) and V T g(Y)
  • the correlation between the) is maximized to adjust the parameters of the typical correlation analysis (ie, U and V) and the parameters of the deep neural network.
  • the parameters of the deep neural network may include the above W f and W g , and may also include Its structural parameters (including the number of layers of the deep neural network and the dimensions on each layer), so that the final third text vector can be determined as U T f(X), V T g(Y) or based on U T f(X) A vector determined by at least one of V T g(Y) (eg, a weighted average equivalent transformation of the two).
  • the structural parameters of the deep neural network may also be predefined according to factors such as the computing system environment. According to one example of the present invention, the predetermined structure is 4 layers, and the dimensions of each layer are 100, 1024, 1024, and 100, respectively.
  • N represents the total number of text vector sets
  • I represents the identity matrix
  • (r x , r y ) > 0 is a regularization parameter for covariance estimation.
  • the depth can be determined parameters of the neural network and W f W g and CCA parameters U and V.
  • the first text vector X and the second text vector Y are respectively input into the corresponding neural networks f( ⁇ ) and g( ⁇ ), and then converted by CCA to obtain the text.
  • the target third text vector such as U T f(X), until the text vector transformation of all pending text is completed.
  • the local or global is relative to all the text collections to be processed, and those skilled in the art can extract the relevant texts from the entire text collection as local training data according to the specific language processing tasks they face.
  • global or local training data can also be selected based on its requirements for model accuracy and computing resources.
  • target optimization function is merely an example and not a limitation, and those skilled in the art can also design an objective function suitable for actual needs based on the principle of the present disclosure according to a specific optimization goal.
  • CCA Canonical Correlation Analysis
  • DCCA Deep Canonical Correlation Analysis
  • the first text vector and the second text vector may be reconstructed by using an automatic encoder to adjust the relevant parameters while minimizing the automatic coding error while maximizing the correlation, thereby determining the corresponding The third text vector.
  • an automatic encoder to adjust the relevant parameters while minimizing the automatic coding error while maximizing the correlation, thereby determining the corresponding The third text vector.
  • FIG. 4 is a schematic diagram showing an implementation of further applying an automatic encoder to determine the correlation between text vectors for the scheme shown in FIG.
  • two auto-encoders are further added to convert the first text vector and the second text vector after nonlinear transformation through the deep neural network. Reconstruction is performed, which may be referred to as Deep Canonically Correlated Auto-Encoders (DCCAE) hereinafter.
  • DCCAE Deep Canonically Correlated Auto-Encoders
  • techniques other than CCA can also be applied to determine the correlation.
  • the meanings of the symbols X, Y, U, V, f( ⁇ ), and g( ⁇ ) are the same as those described above with reference to FIG. 3, and are not repeated here, the symbol p( ⁇ ) and q( ⁇ ) denotes a nonlinear transformation of an automatic encoder (ie, a deep neural network) for reconstruction, respectively, whose parameters are W p and W q , respectively .
  • an automatic encoder ie, a deep neural network
  • variable f(X) of the first text vector and the variable g(Y) of the second text vector after the non-linear transformation of the deep neural network are simultaneously input to the CCA module and the automatic encoder module.
  • CCA module and the automatic encoder module are simultaneously input to the CCA module and the automatic encoder module.
  • the automatic coding error ie, the reconstructed first text vector p(f(X)) and the second text vector q(g(y)) respectively with the original first text vector X and the second text
  • ) are minimized to the optimization objectives to adjust the parameters of the typical correlation analysis (ie, U and V), the parameters of the deep neural network (ie, W f and W g ) and the parameters of the autoencoder (ie, W p and W q ), so that the final third text vector can be determined to be U T f(X), V T g(Y) or a vector determined based on at least one of U T f(X) and V T g(Y).
  • the above calculation process can be mathematically represented, for example, as finding the absolute value of the difference between the covariance between U T f(X) and V T g(Y) and p(f(X)) and X, and q( The sum of the absolute values of the difference between g(y)) and Y is the smallest U, V, W f , W g , W p and W q , which can be expressed, for example, as the following expression (4):
  • is the normalized constant (usually the ratio of the control automatic coding error in the objective function) used to control the level of the automatic encoder, which is an empirical value or a value determined by a finite number of experiments.
  • the objective function is merely an example and not a limitation, and one skilled in the art can modify the objective function according to actual design goals.
  • the parameters W f and W g and the CCA parameters U and V of the determined deep neural network can be obtained.
  • the first text vector X and the second text vector Y are respectively input into the corresponding neural networks f( ⁇ ) and g( ⁇ ), and then converted by CCA to obtain the text.
  • the target third text vector such as U T f(X), until the text vector transformation of all pending text is completed.
  • the optimization objective function may not maximize the correlation, but a preset maximum number of iterations or a correlation that satisfies a predetermined threshold, or the like, or may employ a correlation analysis technique other than CCA, etc., and such a variant All are considered to fall within the scope of the present disclosure.
  • FIG. 5 is a block diagram illustrating a functional configuration example of an electronic device for text processing according to an embodiment of the present disclosure.
  • the electronic device 500 may include a memory 502 and a processor 504.
  • the memory 502 can be configured to store the multi-view text feature representation model established above.
  • the processor 504 can be configured to read the multi-view text feature representation model from the memory 502 and map the text object to be processed to a corresponding multi-dimensional based on the multi-view text feature representation model Real number vector.
  • the text object to be processed may be stored in the memory 502 or an external memory, or may be input by a user, such as a user inputting a voice, and the voice recognition module converts the voice into text, and is processed by the solution of the present disclosure.
  • the text object may be, for example, a word
  • the multi-view text feature representation model is, for example, a word feature representation model.
  • the processor 504 can appropriately divide the phrase, sentence, or paragraph into a plurality of word units by using an existing word division technique, and based on the word feature representation
  • the model maps the plurality of word units to corresponding word vectors for performing natural language understanding processing such as element extraction, sentence classification, and automatic translation.
  • the established multi-view text feature representation model is, for example, a feature representation model of a text object such as a phrase or a sentence
  • the sentence can be directly mapped and sentenced.
  • the paragraph is divided into phrases or paragraphs into sentences, and the text objects are mapped to corresponding text vectors based on the multi-view text feature representation model, and the phrases, sentences or paragraphs are understood based on the text vectors. That is, in the actual processing, a process of word division may also be required, which may employ a technique well known in the prior art and is not related to the inventive point of the present invention, and thus will not be described in detail herein.
  • the present disclosure also provides the following method embodiments.
  • a process example of a method for text processing according to an embodiment of the present disclosure will be described with reference to FIGS. 6 and 7.
  • FIG. 6 is a flowchart illustrating a process example of a method for text processing according to an embodiment of the present disclosure. The method corresponds to an embodiment of an electronic device for text processing described above with reference to FIG.
  • step S610 a correlation between the first text vector and the second text vector is determined, and the first text vector and the second text vector are respectively generated based on the same text. Real number vector.
  • step S620 a third text vector is obtained according to the determined correlation for representing the text, and the vector space in which the third text vector is located is related to the vector space in which the first text vector and the second text vector are located.
  • the text corresponds to a word, a phrase consisting of a plurality of words, or a sentence consisting of a plurality of phrases.
  • the first text vector and the second text vector are respectively based on the first word feature representation model and the second word feature representation model, and the first word feature representation model and the second word feature representation model are respectively based on different word feature representation training Mechanisms and/or different training corpora.
  • the word feature representation training mechanism may include at least one of the following: a Word2Vec mechanism, a GloVe mechanism, and a C&W mechanism, that is, two of the three training mechanisms may be selected as the first feature representation model and the second word feature representation, respectively. The training mechanism of the model.
  • the method further comprises determining a correlation between the first text vector and the second text vector based on the representative correlation analysis, and adjusting the parameters of the typical correlation analysis with the goal of satisfying the correlation to satisfy the predetermined condition.
  • the method further comprises: respectively determining a correlation between the corresponding first text vector and the second text vector for the plurality of texts and obtaining a corresponding third text vector; and establishing a third text vector based on the plurality of texts Multi-view text feature representation model.
  • the method may further comprise determining a correlation between text vectors based on the above-described schemes such as DCCA and DCCAE.
  • FIG. 7 is a flowchart illustrating a process example of a method for text processing according to an embodiment of the present disclosure. This method corresponds to an embodiment of an electronic device for text processing described above with reference to FIG.
  • step S710 the above-described established multi-view text feature representation model is read from the memory.
  • step S720 the text object to be processed is mapped to the corresponding multi-dimensional real number vector based on the multi-view text feature representation model.
  • the text object to be processed may be stored in an internal memory or an external memory, or may be input by a user.
  • the text object may correspond to a word
  • the method may further comprise textual understanding of at least one of a phrase, a sentence, and a paragraph containing the text object based on the multi-dimensional real number vector of the text object.
  • the multi-view text feature representation model established according to an embodiment of the present disclosure When the multi-view text feature representation model established according to an embodiment of the present disclosure is applied to perform tasks in natural language understanding, it can effectively optimize processing performance.
  • the following is an example of when the multi-view text feature representation model respectively established by the text feature representation model constructed according to the prior art and the CCA, DCCA, and DCCAE schemes according to the present invention is applied to the feature extraction task in the spoken language comprehension, respectively. Comparison of processing performance between models.
  • the present invention can also be applied to any other task in natural language understanding, such as the above-described part-of-speech tagging, named entity recognition, and the like. That is, the electronic device 500 of the present disclosure may actually further include a high-level natural language processing module such as a feature extraction module, a part-of-speech tagging module, or a named entity recognition module, in response to a text mapping to be processed based on the multi-view text feature representation model.
  • the obtained multi-dimensional real number vector, the above high-level language processing module further performs the corresponding natural language understanding.
  • the task of extracting elements is specifically extracting the elements in the input sentence and marking them.
  • the data set used is the Air Travel Information System (ATIS)
  • the specific task of factor extraction is: today's flight from Boston to Seattle, after the execution of the feature extraction
  • RNN Recurrent Neural Network
  • Elman type RNN the Elman type RNN
  • Jordan type RNN the word embedding techniques involved in the experimental comparison include: random method, Word2Vec, GloVe, CCA scheme based on Word2Vec and GloVe, DCCA scheme, and DCCAE scheme.
  • the metric used here to measure performance in the feature extraction task is defined as the F1 measure, which represents the harmonic mean of the accuracy and recall.
  • Table 2 below shows the experimental comparison results:
  • the multi-view text feature representation model established according to the techniques of the present disclosure can achieve better performance.
  • the multi-view text feature representation model established in accordance with the techniques of the present disclosure may also achieve better performance in other natural language understanding tasks.
  • machine-executable instructions in the storage medium and the program product according to the embodiments of the present disclosure may also be executed as described above for the text processing, and thus portions not described in detail herein may refer to the previous corresponding positions. Description, description will not be repeated here.
  • a storage medium for carrying a program product storing the above-described machine-readable instruction code and a storage medium for carrying the multi-view text feature representation model of the present disclosure are also included in the disclosure of the present invention.
  • the storage medium includes, but is not limited to, a floppy disk, an optical disk, a magneto-optical disk, a memory card, a memory stick, and the like.
  • the above series of processing and devices can also be through software and / or firmware. achieve.
  • a program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware structure, such as the general-purpose personal computer 800 shown in FIG. 8, which is installed with various programs. When able to perform various functions and so on.
  • a central processing unit (CPU) 801 executes various processes in accordance with a program stored in a read only memory (ROM) 802 or a program loaded from a storage portion 808 to a random access memory (RAM) 803.
  • ROM read only memory
  • RAM random access memory
  • data required when the CPU 801 executes various processes and the like is also stored as needed.
  • the CPU 801, the ROM 802, and the RAM 803 are connected to each other via a bus 804.
  • Input/output interface 805 is also coupled to bus 804.
  • the following components are connected to the input/output interface 805: an input portion 806 including a keyboard, a mouse, etc.; an output portion 807 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker and the like;
  • the storage portion 808 includes a hard disk or the like; and the communication portion 809 includes a network interface card such as a LAN card, a modem, and the like.
  • the communication section 809 performs communication processing via a network such as the Internet.
  • the driver 810 is also connected to the input/output interface 805 as needed.
  • a removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 810 as needed, so that a computer program read therefrom is installed into the storage portion 808 as needed.
  • a program constituting the software is installed from a network such as the Internet or a storage medium such as the removable medium 811.
  • such a storage medium is not limited to the removable medium 811 shown in FIG. 8 in which a program is stored and distributed separately from the device to provide a program to the user.
  • the detachable medium 811 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disk read only memory (CD-ROM) and a digital versatile disk (DVD)), and a magneto-optical disk (including a mini disk (MD) (registered trademark) )) and semiconductor memory.
  • the storage medium may be a ROM 802, a hard disk included in the storage portion 808, or the like, in which programs are stored, and distributed to the user together with the device containing them.
  • a plurality of functions included in one unit in the above embodiment may be separated by a device to realise.
  • a plurality of functions implemented by a plurality of units in the above embodiments may be implemented by separate devices, respectively.
  • one of the above functions may be implemented by a plurality of units. Needless to say, such a configuration is included in the technical scope of the present disclosure.
  • the steps described in the flowcharts include not only processes performed in time series in the stated order, but also processes performed in parallel or individually rather than necessarily in time series. Further, even in the step of processing in time series, it is needless to say that the order can be appropriately changed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种用于文本处理的电子设备和方法,该电子设备包括处理器(100),该处理器被配置为:确定第一文本向量与第二文本向量之间的相关性,第一文本向量和第二文本向量是分别基于同一文本生成的多维实数向量;以及根据相关性获得第三文本向量以用于表示该文本,其中,第三文本向量所在的向量空间与第一文本向量和第二文本向量所在的向量空间相关。该电子设备和方法可以建立结合多个视角进行文本特征表示的文本特征表示模型,从而能够提高自然语言处理的性能。

Description

用于文本处理的电子设备和方法
本申请要求于2016年3月22日提交中国专利局、申请号为201610166105.3、发明名称为“用于文本处理的电子设备和方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及自然语言处理领域,更具体地,涉及一种用于文本处理的电子设备和方法,其基于两个或更多个词特征表示模型之间的相关性来构建多视角词特征表示模型,以实现对文本对象的特征的深度共享视角表示,从而更有利于后续的自然语言处理。
背景技术
在传统的自然语言理解(Natural Language Understanding,NLU)算法中,文本(例如,词语)被当作一个离散的符号,词语的表示是独立的、离散的,使得词之间并没有很大的关联。例如:
“中国”的表示为[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0...]
“北京”的表示为[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0...]
该离散向量的维度是词典的大小,因此维度通常较高。这种简洁的离散表示方法通过配合最大熵、支持向量机(Support Vector Machine,SVM)、条件随机场(Condition Random Field,CRF)等统计模型已经很好地完成了自然语言处理(Natural Language Processing,NLP)领域的各种主流任务,例如,词性标注(Part-of-Speech Tagging)、要素抽取(Slot Filling)、命名实体识别(Named Entity Recognition)等。
然而,这种离散表示方法通常也意味着我们需要更多的训练数据去成功地训练统计模型,因此运算量较大,并且这种词的独立表示往往不能反映词语之间的语义关联,从而对于自然语言理解可能是不利的。
近年来发展起来的词嵌入(word embedding)技术克服了这些缺点。词嵌入简单来说就是把离散的文本(例如,词语、短语或句子)表示为低维空间的 向量。以词语为例,利用词嵌入技术的词向量表示通常为例如:
“中国”的表示为[0.0172,-0.77,-0.507,0.1,-0.42,...]
“北京”的表示为[0.01,-0.8,-0.5,0.123,-0.142,...]
在词嵌入技术中,词向量的纬度以50维、100维、300维比较常见。由于词嵌入技术考虑了各个文本之间的语义关系,因此各个词语的向量表示并不是完全独立的而是存在一定的语义关联,这样,不仅使得词向量表示的维度大大降低从而降低了计算复杂度,而且还使得这样的词向量表示更加有利于自然语言处理和口语理解中的任务。
C&W,Word2vec和GloVe是近年来被广泛使用的几种词嵌入技术。随着深度学习的发展,词嵌入技术已经成为自然语言处理和口语理解中不可缺少的重要分支,且该技术已经取得了一定的成功。
然而,现有的词嵌入技术仅是从一个视角出发(例如,采用同一种训练机制或者基于同一个训练语料库)来进行词特征表示,这样的词特征表示通常具有局限性,即,在某一方面具有较突出的优点而在其它方面有所欠缺。例如,Word2Vec依赖于跳元(skip-grams)或连续词袋(Continuous Bag of Words,CBOX)模型来创建词向量从而可以获得长的词上下文,而GloVe是基于全局词共现矩阵的非零项训练的,这需要对整个语料库进行遍历以收集统计信息。又例如,针对新闻报导的训练语料库和针对日常口语的训练语料库训练得到的词特征表示在各个词语间的语义关联上各有偏重而具有局限性。
发明内容
在下文中给出了关于本公开的简要概述,以便提供关于本公开的某些方面的基本理解。但是,应当理解,这个概述并不是关于本公开的穷举性概述。它并不是意图用来确定本公开的关键性部分或重要部分,也不是意图用来限定本公开的范围。其目的仅仅是以简化的形式给出关于本公开的某些概念,以此作为稍后给出的更详细描述的前序。
鉴于以上问题,本公开的目的是提供一种用于文本处理的电子设备和方法,其从多个视角出发,根据以不同视角表示文本对象的不同文本特征表示之间的相关性来提供该文本对象的深度共享视角特征表示,以优化执行自然语言 处理和口语理解等任务时的系统性能。
根据本公开的一方面,提供了一种用于文本处理的电子设备,该电子设备包括处理器,该处理器被配置为:确定第一文本向量与第二文本向量之间的相关性,第一文本向量和第二文本向量是分别基于同一文本生成的多维实数向量;以及根据相关性获得第三文本向量以用于表示该文本,其中,第三文本向量所在的向量空间与第一文本向量和第二文本向量所在的向量空间相关。
根据本公开的优选实施例,文本对应于词语。
根据本公开的另一优选实施例,文本对应于以下之一:多个词语组成的短语;以及多个短语组成的句子。
根据本公开的另一优选实施例,第一文本向量和第二文本向量分别基于第一词特征表示模型和第二词特征表示模型。
根据本公开的另一优选实施例,第一词特征表示模型和第二词特征表示模型是分别基于不同的词特征表示训练机制得到的。
根据本公开的另一优选实施例,词特征表示训练机制包括以下至少之一:Word2Vec机制、GloVe机制和C&W机制。
根据本公开的另一优选实施例,处理器进一步被配置为:基于典型相关分析来确定第一文本向量与第二文本向量之间的相关性,并且以使得相关性满足预定条件为目标来调整典型相关分析的参数。
根据本公开的另一优选实施例,处理器进一步被配置为:利用神经网络对第一文本向量和第二文本向量进行处理以得到第一文本向量的变量和第二文本向量的变量,基于第一文本向量的变量和第二文本向量的变量确定相关性,并且以使得相关性满足预定条件为目标来调整神经网络的参数。
根据本公开的另一优选实施例,处理器进一步被配置为:利用自动编码器对第一文本向量的变量和第二文本向量的变量进行处理以重构第一文本向量和第二文本向量,并且以还使得重构后的第一文本向量和第二文本向量与第一文本向量和第二文本向量之间的误差满足预定条件为目标来调整自动编码器和神经网络的参数,以确定相关性。
根据本公开的另一优选实施例,处理器进一步被配置为针对多个文本分别确定相应的第一文本向量与第二文本向量之间的相关性并获得相应的第三 文本向量,并且该电子设备还包括存储器,该存储器被配置为存储多个文本的第三文本向量以用于建立多视角文本特征表示模型。
根据本公开的另一优选实施例,处理器进一步被配置成针对多个文本中的每个文本,还基于关于其它文本的相关性来确定该文本的相应的第一文本向量与第二文本向量之间的相关性。
根据本公开的另一方面,还提供了一种用于文本处理的方法,该方法包括:确定第一文本向量与第二文本向量之间的相关性,第一文本向量和第二文本向量是分别基于同一文本生成的多维实数向量;以及根据相关性获得第三文本向量以用于表示该文本,其中,第三文本向量所在的向量空间与第一文本向量和第二文本向量所在的向量空间相关。
根据本公开的另一方面,还提供了一种用于文本处理的电子设备,该电子设备包括:存储器,被配置为存储多视角文本特征表示模型,其中,该多视角文本特征表示模型是利用上述方法建立的;以及处理器,被配置为从存储器读取多视角文本特征表示模型,并且基于该多视角文本特征表示模型将待处理的文本对象映射为相应的多维实数向量。
根据本公开的另一方面,还提供了一种用于文本处理的方法,该方法包括:从存储器读取多视角文本特征表示模型,其中,该多视角文本特征表示模型是利用上述方法建立的;以及基于多视角文本特征表示模型将待处理的文本对象映射为相应的多维实数向量。
根据本公开的其它方面,还提供了用于实现上述根据本公开的方法的计算机程序代码和计算机程序产品以及其上记录有该用于实现上述根据本公开的方法的计算机程序代码的计算机可读存储介质。另外,还提供了用于承载本公开的多视角文本特征表示模型的计算机可读存储介质。
根据本公开的实施例,通过结合多个视角来表示文本特征,以此来建立多视角文本特征表示模型,可以克服现有技术中单一视角的文本特征表示模型的不足,从而能够提高应用于自然语言处理时的性能。
在下面的说明书部分中给出本公开实施例的其它方面,其中,详细说明用于充分地公开本公开实施例的优选实施例,而不对其施加限定。
附图说明
本公开可以通过参考下文中结合附图所给出的详细描述而得到更好的理解,其中在所有附图中使用了相同或相似的附图标记来表示相同或者相似的部件。所述附图连同下面的详细说明一起包含在本说明书中并形成说明书的一部分,用来进一步举例说明本公开的优选实施例和解释本公开的原理和优点。其中:
图1是示出根据本公开的实施例的用于文本处理的电子设备的功能配置示例的框图;
图2是示出根据本公开的实施例的基于典型相关分析(Canonical Correlation Analysis,CCA)来确定文本向量间的相关性的实现方案的示意图;
图3是示出对图2所示的方案进一步应用神经网络来确定文本向量间的相关性的实现方案的示意图;
图4是示出对图3所示的方案进一步应用自动编码器来确定文本向量间的相关性的实现方案的示意图;
图5是示出根据本公开的实施例的用于文本处理的电子设备的功能配置示例的框图;
图6是示出根据本公开的实施例的用于文本处理的方法的过程示例的流程图;
图7是示出根据本公开的实施例的用于文本处理的方法的过程示例的流程图;以及
图8是示出作为本公开的实施例中可采用的信息处理设备的个人计算机的示例结构的框图。
具体实施方式
在下文中将结合附图对本公开的示范性实施例进行描述。为了清楚和简明起见,在说明书中并未描述实际实施方式的所有特征。然而,应该了解,在开发任何这种实际实施例的过程中必须做出很多特定于实施方式的决定,以便实现开发人员的具体目标,例如,符合与系统及业务相关的那些限制条件,并且这些限制条件可能会随着实施方式的不同而有所改变。此外,还应该了解, 虽然开发工作有可能是非常复杂和费时的,但对得益于本公开内容的本领域技术人员来说,这种开发工作仅仅是例行的任务。
在此,还需要说明的一点是,为了避免因不必要的细节而模糊了本公开,在附图中仅仅示出了与根据本公开的方案密切相关的设备结构和/或处理步骤,而省略了与本公开关系不大的其它细节。
接下来,将参照图1至图8详细描述本公开的实施例。
首先,将参照图1描述根据本公开的实施例的用于文本处理的电子设备的功能配置示例。图1是示出根据本公开的实施例的用于文本处理的电子设备的功能配置示例的框图。
如图1所示,根据该实施例的电子设备100可包括相关性确定单元102和文本向量生成单元104。应指出,这里的相关性确定单元102和文本向量生成单元104可以是分立的物理实体或逻辑实体,或者也可由同一个物理实体(例如,中央处理单元(CPU)、大规模集成电路(ASIC)等)来实现。
相关性确定单元102可被配置成确定第一文本向量与第二文本向量之间的相关性,其中,这里的第一文本向量和第二文本向量是分别基于同一文本生成的多维实数向量。该文本例如可以是词语、由多个词语组成的短语或者由多个短语组成的句子。
以文本是词语为例,第一文本向量和第二文本向量分别基于第一词特征表示模型和第二词特征表示模型,这两个词特征表示模型是分别从不同的视角所建立的词特征表示模型。例如,这两个词特征表示模型是分别基于不同的词特征表示训练机制得到,优选地,这里的词特征表示训练机制可包括Word2Vec机制、GloVe机制和C&W机制中的至少一个,即,可从这三种训练机制中选择两种分别作为用于第一词特征表示模型和第二次特征模型的训练机制。这些机制均是现有技术中常用的词嵌入技术,在此不再对其进行详细描述。作为示例,第一词特征表示模型是基于Word2Vec机制获得的,而第二词特征表示模型是基于GloVe机制获得的。可以理解,随着技术的发展、改进,可能出现其他的主流词特征表示训练机制,本领域技术人员显然也可根据本公开的构思融合基于其他的两种主流词特征表示训练机制得到的词特征表示模型。
另一方面,替选地,这两个词特征表示模型是分别基于不同的训练语料 (corpus)得到的。例如,第一词特征表示模型是基于一般语料(例如,大规模的新闻语料或网页文本)得到的,而第二词特征表示模型是基于用户固有语料(例如,邮件语料、口语语料等)训练得到的,其中用于训练第一词特征表示模型和第二词特征表示模型的训练机制可以相同也可以不同。
需要注意,上述的两种词特征表示模型在一个示例中是执行本公开技术方案的人员自行根据相应的训练机制和语料训练得到的(在线或离线的形式),例如根据其具体的语言处理任务进行针对性的训练得到的,在另一个示例中是直接从外部获取的,例如从学术研究共享平台上获取的他人已经训练好的词特征表示模型来作为待融合的词特征表示模型。另外,本公开主要以两个词特征表示模型的融合作为示例,然而,本领域技术人员可以理解,根据实际需要,还可以基于本公开执行多于两个词特征表示模型的融合,例如,先依据本公开的方案针对第一及第二词特征表示模型进行融合,将融合得到的第三词特征表示模型再根据本公开的方案与第四词特征表示模型进行融合;亦可以先依据本公开的方案针对第一及第二词特征表示模型进行融合以得到第三词特征表示模型,同时针对第四及第五词特征表示模型进行融合以得到第六词特征表示模型,再将第三词特征表示模型与第六词特征表示模型进行融合,在此不再赘述。
优选地,相关性确定单元102可进一步被配置成基于典型相关分析(CCA)来确定第一文本向量与第二文本向量之间的相关性,并且以使得该相关性满足预定条件为目标来调整典型相关分析的参数。典型相关分析(CCA)是用于分析两组变量之间的相关关系的一种常用统计分析方法,在此将其应用于确定词嵌入技术中的两组词特征表示(即,词向量)之间的相关性。然而,应理解,本领域技术人员显然也可想到利用其它相关性分析方法(包括现有的或者未来可能出现的分析方法)来确定第一文本向量与第二文本向量之间的相关性。
在这里,将简要介绍CCA。CCA是用于找到两个随机向量的相关性最大的线性投影的无监督数据分析的标准技术。在数学上,我们定义两个随机向量(X1,X2),其协方差矩阵定义为(∑11,∑22)并且互协方差矩阵定义为∑12。(r1,r2)>0是协方差矩阵∑11和∑22的两个正则项以保证样本协方差的非特异 性。CCA试图找到两个视角A1、A2的相关性最大的一对线性投影
Figure PCTCN2017077473-appb-000001
如以下表达式(1)所示:
Figure PCTCN2017077473-appb-000002
表达式(1)是经典的半定规划(semi-definite programming)。假设中间项为
Figure PCTCN2017077473-appb-000003
并且令Uk和Vk为T的前k个左奇异向量和前k个右奇异向量,则最优解为以下表达式(2)所示:
Figure PCTCN2017077473-appb-000004
在实施例的以下描述中,将以典型相关分析为例来描述本公开的技术,但是应理解,本公开并不限于此。
图2是示出根据本公开的实施例的基于典型相关分析来确定文本向量间的相关性的实现方案的示意图。
如图2所示,假设X和Y分别为第一文本向量和第二文本向量,并且U和V分别为典型相关分析的线性变换参数。根据典型相关分析,这里例如以使得经线性变换后的第一文本向量(UTX)与第二文本向量(VTY)之间的相关性最高为优化目标来调整参数U和V,即,在数学上可以表示为例如以使得UTX与VTY之间的协方差最小为优化目标来确定参数U和V的值,其中(·)T表示矩阵的转置。这里应理解,尽管这里以使得UTX与VTY之间的相关性最高为例来描述如何调整典型相关分析的参数U和V,但是本公开不限于此,而是也可根据实际情况(例如,计算能力等)而以满足其它预定条件(例如,预定相关性阈值、预定迭代次数等)的相关性为目标来确定典型相关分析的参 数,这同样适用于随后的实施例中的描述。根据优化目标函数来确定线性变换参数U和V的具体过程是本领域技术人员根据相关数学知识可以实现的,在此不再详细描述。
返回参照图1,文本向量生成单元104可被配置成根据所确定的相关性来获得第三文本向量以表示同一文本。
在图2所示的示例中,根据在相关性满足预定条件时所确定的U和V,可以获得同一文本的两个文本特征表示,即UTX、VTY,其中任一者皆可作为第三文本向量,换言之,第三文本向量例如可以表示为UTX、VTY或者基于UTX和VTY中至少之一确定的向量(例如两者的加权平均等变换形式)。
可以理解,如上所述,由于参数U和V是基于第一文本向量与第二文本向量之间的相关性所确定的,因此所生成的第三文本向量所在的向量空间与第一文本向量和第二文本向量所在的向量空间具有相关性。这样,所生成的第三文本向量考虑了基于不同视角得到的第一文本向量和第二文本向量之间的相关性,因此其是对同一文本的多视角、深度特征表示,能够提高后续的自然语言处理的性能。另外,本公开的技术方案可以是针对既得的至少两个文本特征表示模型进行融合从而易于实现和推广,不必再例如重新统合两种语料进行训练。
以上描述了对于一个文本进行处理以得到新的第三文本向量的示例,类似地,可对多个文本为了进行类似处理以得到相应的第三文本向量的集合,以用于建立多视角文本特征表示模型。
优选地,上述相关性确定单元102可进一步被配置成针对多个文本,通过上述方式分别确定相应的第一文本向量与第二文本向量之间的相关性,并且文本向量生成单元104可进一步被配置成根据关于各个文本所确定的相关性来获得相应的第三文本向量。
优选地,该电子设备100还可包括存储器106,存储器106可被配置为存储这多个文本的第三文本向量以用于建立多视角文本特征表示模型,该多视角文本特征表示模型表示从文本对象到基于多个视角确定的文本向量的映射,可用于执行后续的自然语言处理中的各种任务。
此外,优选地,相关性确定单元102可进一步被配置成针对多个文本中 的每个文本,还基于关于其它文本的相关性来确定该文本的第一文本向量与第二文本向量之间的相关性。根据上述方式,针对每个文本,可仅基于该文本自身的第一文本向量与第二文本向量之间的相关性来确定其对应的第三文本向量,然后根据这些分别确定的第三文本向量的集合来建立新的多视角文本特征表示模型。然而,通常地,当对特定文本集合进行处理以建立新的文本特征表示模型时,取代逐文本地确定第三文本向量,还可基于针对该文本集合的第一文本向量集合与第二文本向量集合之间的整体相关性来确定针对该文本集合的典型相关分析的参数,即,在确定关于特定文本的相关性时还需要将其它文本的相关性纳入考虑,由此来确定针对该文本集合的第三文本向量集合,从而用于建立多视角文本特征表示模型。
利用典型相关分析技术、以文本集合作为整体来确定第一文本向量集合与第二文本向量集合之间的相关性的具体实现过程可参见典型相关分析技术的原理,在此不再详细描述。此外,应指出,在以下参照图3和图4描述的确定相关性的示例实现方案中,均是以多个文本的集合作为整体来确定相关性,但是替选地也可逐文本地来确定相关性,从而根据相关性来确定相应的第三文本向量集合以用于建立多视角文本特征表示模型,本领域技术人员可根据实际情况而选择具体的实现方式,本公开对此不作限制。
优选地,还可进一步利用神经网络来确定上述相关性。图3是示出对图2所示的方案进一步应用神经网络来确定文本向量间的相关性的实现方案的示意图。
如图3所示,在图2所示的方案的基础上,进一步添加了两个独立的深度神经网络(Deep Neural Network,DNN)以对所输入的两个文本向量X和Y(这里的X和Y也可表示文本向量集合)进行非线性变换,然后再利用典型相关分析(CAA)来确定非线性变换后的向量之间的相关性,该方案在下文中也可以称为深度典型相关分析(Deep Canonical Correlation Analysis,DCCA)。然而,应理解,尽管这里以深度神经网络与典型相关分析的组合为例来确定文本向量间的相关性,但是如上所述,也可利用深度神经网络与其它相关性分析技术的组合来执行该确定。此外,这里利用两个独立的深度神经网络来进行非线性变换是为了降低计算复杂度,在不考虑计算复杂度的情况下,当然也可利 用一个深度神经网络来对第一和第二文本向量进行非线性变换。
在图3所示的示例中,符号X、Y、U和V的含义与以上参照图2描述的相同,在此不再重复,f(·)和g(·)分别表示两个深度神经网络的非线性变换,其参数分别为Wf和Wg。根据图3所示的方案,第一文本向量X和第二文本向量Y首先经过深度神经网络以接受非线性变换,并且变换后的第一文本向量的变量和第二文本向量的变量分别记为f(X)和g(Y)。然后,利用CCA对f(X)和g(Y)分别进行线性变换,并且以使得线性变换后的f(X)和g(Y)(即,UTf(X)和VTg(Y))之间的相关性最大化为目标来调整典型相关分析的参数(即,U和V)和深度神经网络的参数,深度神经网络的参数可包括上述Wf和Wg,另外还可以包括其结构参数(包括深度神经网络的层数和每层上的维度),从而可以确定最终的第三文本向量为UTf(X)、VTg(Y)或者基于UTf(X)和VTg(Y)中至少之一确定的向量(例如两者的加权平均等变换形式)。其中,深度神经网络的结构参数也可以是根据运算系统环境等因素预定义的,根据本发明的一个示例预定结构为4层,每层的维度分别为100、1024、1024和100。
上述计算过程在数学上可以表示为寻找使得UTf(X)与VTg(Y)之间的协方差最小的U、V、Wf和Wg,例如可以表示为如下表达式(3):
Figure PCTCN2017077473-appb-000005
其中,N表示文本向量集合的总数,I表示单位矩阵,并且(rx,ry)>0是用于协方差估计的正则化参数。
如何根据上述目标优化函数对模型进行训练以确定深度神经网络的参数Wf和Wg以及CCA的线性变换参数U和V是本领域技术人员根据掌握的数学知识可以实现的,这并不是本公开的技术的重点,因此在此不再详细描述。例如,可以使用例如受限玻尔兹曼(Restricted Boltzmann Machine,RBM)技术来进行模型的预训练,然后使用反向传播(Back-propagation)例如随机梯度下降(Stochastic Gradient Descent,SGD)技术,基于使得文本向量间的相关 性最大的目标函数对深度神经网络的参数Wf和Wg以及CCA的线性变换参数U和V进行联合优化学习。根据一个示例,利用上述的随机梯度下降方案对DNN的参数进行精细调整,例如先确定DNN顶层的梯度(delta),再根据梯度调整DNN顶层参数如W’f=Wf+调整系数*梯度,进而再推算DNN其他层次的参数。其中,DNN顶层的梯度可以通过基于目标函数(即公式(3))分别针对Hx和Hy对corr(Hx,Hy)进行求导来获得,其中,corr(Hx,Hy)表示Hx和Hy的相关度,Hx=UT f(X)以及Hy=V T g(Y)。
在利用局部或优选的全局训练数据基于DCCA方案对深度神经网络的参数以及CCA的线性变换参数完成训练之后,可以得到确定的深度神经网络的参数Wf和Wg及CCA参数U和V。届时,逐个地针对每个文本,将其第一文本向量X和第二文本向量Y分别输入相应的神经网络f(·)和g(·)中,之后再由CCA进行变换即可获得该文本的目标第三文本向量例如UTf(X),直至完成所有待处理文本的文本向量变换。其中,局部或全局是相对于待处理的全部文本集合而言的,本领域技术人员可以根据其面对的具体语言处理任务,从全部文本集合中抽取出有关的文本作为局部训练数据进行训练以提高效率,也可以根据其对模型精度的要求和运算资源来选择全局或局部训练数据。
此外,应指出,以上给出的目标优化函数仅为示例而非限制,并且本领域技术人员也可根据具体的优化目标,基于本公开的原理而设计适合实际需要的目标函数。
应指出,上述典型相关分析(CCA)和深度典型相关分析(DCCA)均属于无监督学习过程,因此,在确定相关性的过程中,尽管可能获得了第一文本向量与第二文本向量之间的较高相关性,但是在此过程中,可能会使得此时获得的第三文本向量与第一文本向量和/或第二文本向量的差别较大,即,导致较大的失真,这有可能在一定程度上影响后续的自然语言处理的性能。鉴于此,为了进一步优化系统性能,还可通过利用自动编码器重构第一文本向量和第二文本向量,以在最大化相关性的同时最小化自动编码错误来调整相关参数,从而确定相应的第三文本向量。以下将参照图4描述该情况下的实现方案。
图4是示出对图3所示的方案进一步应用自动编码器来确定文本向量间的相关性的实现方案的示意图。
如图4所示,在图3所示的方案的基础上,进一步加入了两个自动编码器(auto-encoder)以对经过深度神经网络非线性变换后的第一文本向量和第二文本向量进行重构,该方案在下文可称为深度典型相关自动编码(Deep Canonically Correlated Auto-Encoders,DCCAE)。类似地,如上所述,在该实现方案中,也可应用除CCA之外的技术来确定相关性。
在图4所示的示例中,符号X、Y、U、V、f(·)和g(·)的含义与以上参照图3描述的相同,在此不再重复,符号p(·)和q(·)分别表示用于重构的自动编码器(即,深度神经网络)的非线性变换,其参数分别为Wp和Wq
根据图4所示的方案,经过深度神经网络的非线性变换后的第一文本向量的变量f(X)和第二文本向量的变量g(Y)同时被输入到CCA模块和自动编码器模块以分别接受相关性分析和重构,并且以在使得线性变换后的f(X)和g(Y)(即,UTf(X)和VTg(Y))之间的相关性最大化的同时使得自动编码误差(即,重构后的第一文本向量p(f(X))和第二文本向量q(g(y))分别与原始的第一文本向量X和第二文本向量Y之间的差的绝对值|p(f(X))-X|和|q(g(y))-Y|)最小化为优化目标来调整典型相关分析的参数(即,U和V)、深度神经网络的参数(即,Wf和Wg)以及自动编码器的参数(即,Wp和Wq),从而可以确定最终的第三文本向量为UTf(X)、VTg(Y)或者基于UTf(X)和VTg(Y)中至少之一确定的向量。
上述计算过程在数学上可以例如表示为寻找使得UTf(X)与VTg(Y)之间的协方差与p(f(X))与X之间的差的绝对值以及q(g(y))与Y之间的差的绝对值的和最小的U、V、Wf、Wg、Wp和Wq,这例如可以表示为如下表达式(4):
Figure PCTCN2017077473-appb-000006
在表达式(4)中,与以上表达式(3)中的符号相同的符号表示相同的 含义,在此不再重复。λ是与用于控制自动编码器的水平的归一化常数(实际上是控制自动编码错误在目标函数中所占的比例),其为经验值或者通过有限次实验确定的值。
如何根据该目标表达式来对相关参数进行联合优化学习可参见以上针对DCCA方案的描述,在此不再重复。此外,应理解,该目标函数仅是示例而非限制,并且本领域技术人员可以根据实际的设计目标而对该目标函数进行修改。在利用局部或全局训练数据基于DCCAE方案对深度神经网络的参数以及CCA的线性变换参数完成训练之后,可以得到确定的深度神经网络的参数Wf和Wg及CCA参数U和V。届时,逐个地针对每个文本,将其第一文本向量X和第二文本向量Y分别输入相应的神经网络f(·)和g(·)中,之后再由CCA进行变换即可获得该文本的目标第三文本向量例如UTf(X),直至完成所有待处理文本的文本向量变换。
应指出,尽管以上参照图2至图4描述了确定文本向量间的相关性的示例实现方案,但是应理解,这仅是示例而非限制,并且本领域技术人员可根据本公开的原理而对上述实现方案进行修改。例如,优化目标函数可以不是使得相关性最大化,而是预设的最大迭代次数或者满足预定阈值的相关性等,或者也可采用除CCA之外的相关性分析技术等等,并且这样的变型均认为落入本公开的范围内。
通过利用上述CCA、DCCA、DCCAE方案来获得文本向量间的相关性,由于结合了多个视角来表示文本特征,因此能够获得深度多视角文本特征表示模型,从而能够提高自然语言理解等中的任务的性能。
接下来,将参照图5描述利用上述获得的多视角文本特征表示模型来进行文本处理的实施例。图5是示出根据本公开的实施例的用于文本处理的电子设备的功能配置示例的框图。
如图5所示,根据该实施例的电子设备500可包括存储器502和处理器504。
存储器502可被配置为存储上述所建立的多视角文本特征表示模型。
处理器504可被配置为从存储器502读取多视角文本特征表示模型,并且基于该多视角文本特征表示模型而将待处理的文本对象映射为相应的多维 实数向量。该待处理的文本对象可存储在存储器502或者外部存储器中,或者也可以是用户输入的,例如用户输入语音,由语音识别模块将语音转化为文本,进而由本公开的方案进行处理。
该文本对象例如可以是词语,并且该多视角文本特征表示模型例如是词特征表示模型。在该情况下,当对短语、句子或段落进行处理时,处理器504可通过利用现有的词划分技术将该短语、句子或段落适当地划分为多个词语单元,并基于该词特征表示模型而将这多个词语单元分别映射为相应的词向量,以用于执行要素抽取、语句分类、自动翻译等自然语言理解处理。
替选地,在所建立的多视角文本特征表示模型例如是短语或句子等文本对象的特征表示模型时,取代将短语、句子、段落等划分为相应的词单元,可通过直接映射、将句子或段落划分为短语或者将段落划分为句子等方式,基于多视角文本特征表示模型将这些文本对象映射为相应的文本向量,并且基于这些文本向量对这些短语、句子或段落进行理解。即,在实际处理过程中,可能还需要进行词语划分的处理,该处理可采用现有技术中公知的技术,并且与本发明的发明点不相关,因此在此不详细描述。
利用所建立的文本特征表示模型来进行自然语言理解等处理的具体过程与现有技术中相同,在此不再详细描述。
在这里,应指出,尽管以上参照图1和图5描述了用于文本处理的电子设备的功能配置示例,但是这仅是示例而非限制,并且本领域技术人员可根据本公开的原理而对上述功能配置进行修改。例如,所示出的各个功能单元可以进行组合、进一步划分或者添加另外的功能单元,并且这样的变型应认为落入本公开的范围内。
与上述装置实施例相对应的,本公开还提供了以下方法实施例。接下来,将参照图6和图7描述根据本公开的实施例的用于文本处理的方法的过程示例。
图6是示出根据本公开的实施例的用于文本处理的方法的过程示例的流程图。该方法对应于以上参照图1描述的用于文本处理的电子设备的实施例。
如图6所示,首先,在步骤S610中,确定第一文本向量与第二文本向量之间的相关性,第一文本向量和第二文本向量是分别基于同一文本生成的多维 实数向量。
接下来,在步骤S620中,根据所确定的相关性获得第三文本向量以用于表示该文本,第三文本向量所在的向量空间与第一文本向量和第二文本向量所在的向量空间相关。
优选地,该文本对应于词语、由多个词语构成的短语或者由多个短语构成的句子。
优选地,第一文本向量和第二文本向量分别基于第一词特征表示模型和第二词特征表示模型,第一词特征表示模型和第二词特征表示模型是分别基于不同的词特征表示训练机制以及/或者不同的训练语料得到的。词特征表示训练机制可包括以下至少之一:Word2Vec机制、GloVe机制和C&W机制,即,可从这三种训练机制中选择两种分别作为用于第一次特征表示模型和第二词特征表示模型的训练机制。
优选地,该方法还包括:基于典型相关分析确定第一文本向量与第二文本向量之间的相关性,并且以使得相关性满足预定条件为目标来调整典型相关分析的参数。
优选地,该方法还包括:针对多个文本分别确定相应的第一文本向量与第二文本向量之间的相关性并获得相应的第三文本向量;以及基于多个文本的第三文本向量建立多视角文本特征表示模型。
此外,优选地,该方法还可包括基于上述DCCA和DCCAE等方案来确定文本向量间的相关性。
利用CCA、DCCA和DCCAE等方案来确定文本向量间的相关性以生成相应的第三文本向量从而建立多视角文本特征表示模型的具体处理过程可参见以上装置实施例中相应位置的描述,在此不再重复。
图7是示出根据本公开的实施例的用于文本处理的方法的过程示例的流程图。该方法对应于以上参照图5描述的用于文本处理的电子设备的实施例。
如图7所示,首先,在步骤S710中,从存储器读取上述建立的多视角文本特征表示模型。接下来,在步骤S720中,基于该多视角文本特征表示模型而将待处理的文本对象映射为相应的多维实数向量。该待处理的文本对象可存储在内部存储器或者外部存储器中,或者也可以是用户输入的。
优选地,该文本对象可对应于词语,并且该方法还可包括基于该文本对象的多维实数向量而对包含该文本对象的短语、句子和段落中至少之一进行文本理解。
应理解,图6和图7所示的方法实施例的流程图仅是示例而非限制,并且本领域技术人员可根据本公开的原理而对上述处理步骤进行修改,例如,对上述处理步骤进行添加、删除、组合和/或变更等,并且这样的变型都应认为落入本公开的范围内。
此外,还应指出,这里参照图6和图7描述的方法实施例分别与以上参照图1和图5描述的装置实施例相对应,因此在此未详细描述的内容可参见以上装置实施例中相应位置的描述,而在此不再重复。
当将根据本公开的实施例所建立的多视角文本特征表示模型应用于执行自然语言理解中的任务时,其能够有效地优化处理性能。下面将作为示例给出当分别将根据现有技术构建的文本特征表示模型和根据本发明的CCA、DCCA和DCCAE方案分别建立的多视角文本特征表示模型应用于口语理解中的要素抽取任务时,各个模型之间的处理性能对比。
应理解,尽管这里给出了要素抽取任务作为示例来检验本发明的实际效果,但是本发明还可以应用于自然语言理解中的任何其它任务,诸如上述词性标注、命名实体识别等任务。也就是说,本公开的例如电子设备500实际上还可以包括要素提取模块、词性标注模块或命名实体识别模块等高层的自然语言处理模块,响应于基于多视角文本特征表示模型对待处理的文本映射得到的多维实数向量,上述高层语言处理模块进一步执行相应的自然语言理解。其中,要素抽取的任务具体来说就是抽取输入句子中的要素并且进行标记。例如,在该对比实验中,作为示例,采用的数据集合是航空交通信息系统(Air Travel Information System,ATIS),并且要素抽取的具体任务是:今天从波士顿到西雅图的航班,执行要素抽取后的结果为以下表1所示:
表1要素抽取的结果
输入的句子 今天 波士顿 西雅图 航班
输出的要素标注结果 B-日期 0 B-出发地 0 B-到达地 0 0
其中,今天是日期的起始词(B-日期),波士顿是出发地的起始词(B-出发地),西雅图是到达地的起始词(B-到达地),并且“0”表示非要素词。应指出,根据该示例,本公开的方案可应用于例如航旅订票系统、日程安排系统等产品中。当然,由于本公开的方案涉及基础的词嵌入技术,还可以广泛地被应用于多种其他语言理解场景。
近年来的研究表明,循环神经网络(Recurrent Neural Network,RNN)在要素抽取任务中能够获得更好的性能,因此,在本实验中,分别采用两种类型的RNN(即,埃尔曼型RNN和乔丹型RNN)来验证本发明的效果,并且参与实验对比的词嵌入技术包括:随机法、Word2Vec、GloVe、基于Word2Vec和GloVe的CCA方案、DCCA方案以及DCCAE方案。
这里用于衡量要素抽取任务中的性能的指标定义为F1测度,其表示准确率和召回率的调和平均值。以下表2示出了实验对比结果:
表2性能比较结果
Figure PCTCN2017077473-appb-000007
从上表可以看出,无论是哪种类型的循环神经网络,根据本公开的技术所建立的多视角文本特征表示模型均能够实现更优的性能。
此外,尽管这里未具体描述,但是根据本公开的技术所建立的多视角文本特征表示模型在其它自然语言理解任务中同样可以实现更优的性能。
应理解,根据本公开的实施例的存储介质和程序产品中的机器可执行的指令还可以被执行以上描述的用于文本处理的方法,因此在此未详细描述的部分可参考先前相应位置的描述,在此不再重复进行描述。
相应地,用于承载上述存储有机器可读取的指令代码的程序产品的存储介质以及用于承载本公开的多视角文本特征表示模型的存储介质也包括在本发明的公开中。所述存储介质包括但不限于软盘、光盘、磁光盘、存储卡、存储棒等等。
另外,还应该指出的是,上述系列处理和装置也可以通过软件和/或固件 实现。在通过软件和/或固件实现的情况下,从存储介质或网络向具有专用硬件结构的计算机,例如图8所示的通用个人计算机800安装构成该软件的程序,该计算机在安装有各种程序时,能够执行各种功能等等。
在图8中,中央处理单元(CPU)801根据只读存储器(ROM)802中存储的程序或从存储部分808加载到随机存取存储器(RAM)803的程序执行各种处理。在RAM 803中,也根据需要存储当CPU 801执行各种处理等等时所需的数据。
CPU 801、ROM 802和RAM 803经由总线804彼此连接。输入/输出接口805也连接到总线804。
下述部件连接到输入/输出接口805:输入部分806,包括键盘、鼠标等等;输出部分807,包括显示器,比如阴极射线管(CRT)、液晶显示器(LCD)等等,和扬声器等等;存储部分808,包括硬盘等等;和通信部分809,包括网络接口卡比如LAN卡、调制解调器等等。通信部分809经由网络比如因特网执行通信处理。
根据需要,驱动器810也连接到输入/输出接口805。可拆卸介质811比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器810上,使得从中读出的计算机程序根据需要被安装到存储部分808中。
在通过软件实现上述系列处理的情况下,从网络比如因特网或存储介质比如可拆卸介质811安装构成软件的程序。
本领域的技术人员应当理解,这种存储介质不局限于图8所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可拆卸介质811。可拆卸介质811的例子包含磁盘(包含软盘(注册商标))、光盘(包含光盘只读存储器(CD-ROM)和数字通用盘(DVD))、磁光盘(包含迷你盘(MD)(注册商标))和半导体存储器。或者,存储介质可以是ROM 802、存储部分808中包含的硬盘等等,其中存有程序,并且与包含它们的设备一起被分发给用户。
还需要指出的是,执行上述系列处理的步骤可以自然地根据说明的顺序按时间顺序执行,但是并不需要一定根据时间顺序执行。某些步骤可以并行或彼此独立地执行。
例如,在以上实施例中包括在一个单元中的多个功能可以由分开的装置 来实现。替选地,在以上实施例中由多个单元实现的多个功能可分别由分开的装置来实现。另外,以上功能之一可由多个单元来实现。无需说,这样的配置包括在本公开的技术范围内。
在该说明书中,流程图中所描述的步骤不仅包括以所述顺序按时间序列执行的处理,而且包括并行地或单独地而不是必须按时间序列执行的处理。此外,甚至在按时间序列处理的步骤中,无需说,也可以适当地改变该顺序。
虽然已经详细说明了本公开及其优点,但是应当理解在不脱离由所附的权利要求所限定的本公开的精神和范围的情况下可以进行各种改变、替代和变换。而且,本公开实施例的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (23)

  1. 一种用于文本处理的电子设备,所述电子设备包括处理器,所述处理器被配置为:
    确定第一文本向量与第二文本向量之间的相关性,所述第一文本向量和所述第二文本向量是分别基于同一文本生成的多维实数向量;以及
    根据所述相关性获得第三文本向量以用于表示所述文本,其中,所述第三文本向量所在的向量空间与所述第一文本向量和所述第二文本向量所在的向量空间相关。
  2. 根据权利要求1所述的电子设备,其中,所述文本对应于词语。
  3. 根据权利要求1所述的电子设备,其中,所述文本对应于以下之一:多个词语组成的短语;以及多个短语组成的句子。
  4. 根据权利要求2所述的电子设备,其中,所述第一文本向量和所述第二文本向量分别基于第一词特征表示模型和第二词特征表示模型。
  5. 根据权利要求4所述的电子设备,其中,所述第一词特征表示模型和所述第二词特征表示模型是分别基于不同的词特征表示训练机制得到的。
  6. 根据权利要求5所述的电子设备,其中,所述词特征表示训练机制包括以下至少之一:Word2Vec机制、GloVe机制和C&W机制。
  7. 根据权利要求4或5所述的电子设备,其中,所述第一词特征表示模型和所述第二词特征表示模型是分别基于不同的训练语料得到的。
  8. 根据权利要求1所述的电子设备,其中,所述处理器进一步被配置为:基于典型相关分析来确定所述第一文本向量与所述第二文本向量之间的相关性,并且以使得所述相关性满足预定条件为目标来调整所述典型相关分析的参数。
  9. 根据权利要求1或8所述的电子设备,其中,所述处理器进一步被配置为:利用神经网络对所述第一文本向量和所述第二文本向量进行处理以得到所述第一文本向量的变量和所述第二文本向量的变量,基于所述第一文本向量的变量和所述第二文本向量的变量确定所述相关性,并且以使得所述相关性满足预定条件为目标来调整所述神经网络的参数。
  10. 根据权利要求9所述的电子设备,其中,所述处理器进一步被配置为: 利用自动编码器对所述第一文本向量的变量和所述第二文本向量的变量进行处理以重构所述第一文本向量和所述第二文本向量,并且以还使得重构后的第一文本向量和第二文本向量与所述第一文本向量和所述第二文本向量之间的误差满足预定条件为目标来调整所述自动编码器和所述神经网络的参数,以确定所述相关性。
  11. 根据权利要求1至10中任一项所述的电子设备,其中,所述处理器进一步被配置为针对多个文本分别确定相应的第一文本向量与第二文本向量之间的相关性并获得相应的第三文本向量,并且所述电子设备还包括存储器,所述存储器被配置为存储所述多个文本的第三文本向量以用于建立多视角文本特征表示模型。
  12. 根据权利要求11所述的电子设备,其中,所述处理器进一步被配置成针对所述多个文本中的每个文本,还基于关于其它文本的所述相关性来确定该文本的相应的第一文本向量与第二文本向量之间的相关性。
  13. 一种用于文本处理的方法,包括:
    确定第一文本向量与第二文本向量之间的相关性,所述第一文本向量和所述第二文本向量是分别基于同一文本生成的多维实数向量;以及
    根据所述相关性获得第三文本向量以用于表示所述文本,其中,所述第三文本向量所在的向量空间与所述第一文本向量和所述第二文本向量所在的向量空间相关。
  14. 根据权利要求13所述的方法,其中,所述文本对应于词语。
  15. 根据权利要求14所述的方法,其中,所述第一文本向量和所述第二文本向量分别基于第一词特征表示模型和第二词特征表示模型,所述第一词特征表示模型和所述第二词特征表示模型是分别基于不同的词特征表示训练机制以及/或者不同的训练语料得到的。
  16. 根据权利要求13所述的方法,其中,还包括:基于典型相关分析确定所述第一文本向量与所述第二文本向量之间的相关性,并且以使得所述相关性满足预定条件为目标来调整所述典型相关分析的参数。
  17. 根据权利要求13至16中任一项所述的方法,还包括:针对多个文本分别确定相应的第一文本向量与第二文本向量之间的相关性并获得相应的第 三文本向量;以及基于所述多个文本的第三文本向量建立多视角文本特征表示模型。
  18. 一种用于文本处理的电子设备,包括
    存储器,被配置为存储多视角文本特征表示模型,其中,所述多视角文本特征表示模型是利用根据权利要求17所述的方法建立的;以及
    处理器,被配置为从所述存储器读取所述多视角文本特征表示模型,并且基于所述多视角文本特征表示模型将待处理的文本对象映射为相应的多维实数向量。
  19. 根据权利要求18所述的电子设备,其中,所述文本对象对应于词语。
  20. 根据权利要求19所述的电子设备,其中,所述处理器进一步被配置为基于所述文本对象的多维实数向量对包含所述文本对象的短语、句子和段落中至少之一进行文本理解。
  21. 一种用于文本处理的方法,包括:
    从存储器读取多视角文本特征表示模型,其中,所述多视角文本特征表示模型是利用根据权利要求17所述的方法建立的;以及
    基于所述多视角文本特征表示模型将待处理的文本对象映射为相应的多维实数向量。
  22. 根据权利要求21所述的方法,其中,所述文本对象对应于词语。
  23. 根据权利要求22所述的方法,还包括:基于所述文本对象的多维实数向量对包含所述文本对象的短语、句子和段落中至少之一进行文本理解。
PCT/CN2017/077473 2016-03-22 2017-03-21 用于文本处理的电子设备和方法 WO2017162134A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US16/080,670 US10860798B2 (en) 2016-03-22 2017-03-21 Electronic device and method for text processing
EP17769411.4A EP3435247A4 (en) 2016-03-22 2017-03-21 ELECTRONIC DEVICE AND TEXT PROCESSING METHOD
CN201780007352.1A CN108475262A (zh) 2016-03-22 2017-03-21 用于文本处理的电子设备和方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610166105.3A CN107220220A (zh) 2016-03-22 2016-03-22 用于文本处理的电子设备和方法
CN201610166105.3 2016-03-22

Publications (1)

Publication Number Publication Date
WO2017162134A1 true WO2017162134A1 (zh) 2017-09-28

Family

ID=59899235

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077473 WO2017162134A1 (zh) 2016-03-22 2017-03-21 用于文本处理的电子设备和方法

Country Status (4)

Country Link
US (1) US10860798B2 (zh)
EP (1) EP3435247A4 (zh)
CN (2) CN107220220A (zh)
WO (1) WO2017162134A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020103721A1 (zh) * 2018-11-19 2020-05-28 腾讯科技(深圳)有限公司 信息处理的方法、装置及存储介质
CN111566665A (zh) * 2020-03-16 2020-08-21 香港应用科技研究院有限公司 在自然语言处理中应用图像编码识别的装置和方法
CN112115721A (zh) * 2020-09-28 2020-12-22 青岛海信网络科技股份有限公司 一种命名实体识别方法及装置
CN112509562A (zh) * 2020-11-09 2021-03-16 北京有竹居网络技术有限公司 用于文本后处理的方法、装置、电子设备和介质
WO2021184385A1 (en) * 2020-03-16 2021-09-23 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for applying image encoding recognition in natural language processing

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305306B (zh) * 2018-01-11 2020-08-21 中国科学院软件研究所 一种基于草图交互的动画数据组织方法
CN108170684B (zh) * 2018-01-22 2020-06-05 京东方科技集团股份有限公司 文本相似度计算方法及系统、数据查询系统和计算机产品
US11023580B1 (en) * 2018-03-21 2021-06-01 NortonLifeLock Inc. Systems and methods for cross-product malware categorization
CN109299887B (zh) * 2018-11-05 2022-04-19 创新先进技术有限公司 一种数据处理方法、装置及电子设备
CN109670171B (zh) * 2018-11-23 2021-05-14 山西大学 一种基于词对非对称共现的词向量表示学习方法
CN109800298B (zh) * 2019-01-29 2023-06-16 苏州大学 一种基于神经网络的中文分词模型的训练方法
CN110321551B (zh) * 2019-05-30 2022-12-06 泰康保险集团股份有限公司 GloVe词向量模型增量训练方法、装置、介质及电子设备
US20210056127A1 (en) * 2019-08-21 2021-02-25 Nec Laboratories America, Inc. Method for multi-modal retrieval and clustering using deep cca and active pairwise queries
CN111047917B (zh) * 2019-12-18 2021-01-15 四川大学 一种基于改进dqn算法的航班着陆调度方法
CN111026319B (zh) * 2019-12-26 2021-12-10 腾讯科技(深圳)有限公司 一种智能文本处理方法、装置、电子设备及存储介质
CN111476026A (zh) * 2020-03-24 2020-07-31 珠海格力电器股份有限公司 语句向量的确定方法、装置、电子设备及存储介质
CN113642302B (zh) * 2020-04-27 2024-04-02 阿里巴巴集团控股有限公司 文本填充模型的训练方法及装置、文本处理方法及装置
CN111797589A (zh) * 2020-05-29 2020-10-20 华为技术有限公司 一种文本处理网络、神经网络训练的方法以及相关设备
CN112115718A (zh) * 2020-09-29 2020-12-22 腾讯科技(深圳)有限公司 内容文本生成方法和装置、音乐评论文本生成方法
EP4254401A4 (en) * 2021-02-17 2024-05-01 Samsung Electronics Co Ltd ELECTRONIC DEVICE AND CONTROL METHOD THEREFOR
CN116911288B (zh) * 2023-09-11 2023-12-12 戎行技术有限公司 一种基于自然语言处理技术的离散文本识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040148154A1 (en) * 2003-01-23 2004-07-29 Alejandro Acero System for using statistical classifiers for spoken language understanding
CN104199809A (zh) * 2014-04-24 2014-12-10 江苏大学 一种专利文本向量的语义表示方法
CN104657350A (zh) * 2015-03-04 2015-05-27 中国科学院自动化研究所 融合隐式语义特征的短文本哈希学习方法
CN104881401A (zh) * 2015-05-27 2015-09-02 大连理工大学 一种专利文献聚类方法
CN104915448A (zh) * 2015-06-30 2015-09-16 中国科学院自动化研究所 一种基于层次卷积网络的实体与段落链接方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6714897B2 (en) * 2001-01-02 2004-03-30 Battelle Memorial Institute Method for generating analyses of categorical data
US8060360B2 (en) * 2007-10-30 2011-11-15 Microsoft Corporation Word-dependent transition models in HMM based word alignment for statistical machine translation
US20140067368A1 (en) * 2012-08-29 2014-03-06 Microsoft Corporation Determining synonym-antonym polarity in term vectors
US20140278349A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Language Model Dictionaries for Text Predictions
US9575952B2 (en) * 2014-10-21 2017-02-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts
US9607616B2 (en) * 2015-08-17 2017-03-28 Mitsubishi Electric Research Laboratories, Inc. Method for using a multi-scale recurrent neural network with pretraining for spoken language understanding tasks
CN106484682B (zh) * 2015-08-25 2019-06-25 阿里巴巴集团控股有限公司 基于统计的机器翻译方法、装置及电子设备
CN106486115A (zh) * 2015-08-28 2017-03-08 株式会社东芝 改进神经网络语言模型的方法和装置及语音识别方法和装置
KR101778679B1 (ko) * 2015-10-02 2017-09-14 네이버 주식회사 딥러닝을 이용하여 텍스트 단어 및 기호 시퀀스를 값으로 하는 복수 개의 인자들로 표현된 데이터를 자동으로 분류하는 방법 및 시스템
US10019438B2 (en) * 2016-03-18 2018-07-10 International Business Machines Corporation External word embedding neural network language models
KR20180001889A (ko) * 2016-06-28 2018-01-05 삼성전자주식회사 언어 처리 방법 및 장치
US10515400B2 (en) * 2016-09-08 2019-12-24 Adobe Inc. Learning vector-space representations of items for recommendations using word embedding models

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040148154A1 (en) * 2003-01-23 2004-07-29 Alejandro Acero System for using statistical classifiers for spoken language understanding
CN104199809A (zh) * 2014-04-24 2014-12-10 江苏大学 一种专利文本向量的语义表示方法
CN104657350A (zh) * 2015-03-04 2015-05-27 中国科学院自动化研究所 融合隐式语义特征的短文本哈希学习方法
CN104881401A (zh) * 2015-05-27 2015-09-02 大连理工大学 一种专利文献聚类方法
CN104915448A (zh) * 2015-06-30 2015-09-16 中国科学院自动化研究所 一种基于层次卷积网络的实体与段落链接方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3435247A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020103721A1 (zh) * 2018-11-19 2020-05-28 腾讯科技(深圳)有限公司 信息处理的方法、装置及存储介质
US11977851B2 (en) 2018-11-19 2024-05-07 Tencent Technology (Shenzhen) Company Limited Information processing method and apparatus, and storage medium
CN111566665A (zh) * 2020-03-16 2020-08-21 香港应用科技研究院有限公司 在自然语言处理中应用图像编码识别的装置和方法
CN111566665B (zh) * 2020-03-16 2021-07-30 香港应用科技研究院有限公司 在自然语言处理中应用图像编码识别的装置和方法
WO2021184385A1 (en) * 2020-03-16 2021-09-23 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for applying image encoding recognition in natural language processing
US11132514B1 (en) 2020-03-16 2021-09-28 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for applying image encoding recognition in natural language processing
CN112115721A (zh) * 2020-09-28 2020-12-22 青岛海信网络科技股份有限公司 一种命名实体识别方法及装置
CN112115721B (zh) * 2020-09-28 2024-05-17 青岛海信网络科技股份有限公司 一种命名实体识别方法及装置
CN112509562A (zh) * 2020-11-09 2021-03-16 北京有竹居网络技术有限公司 用于文本后处理的方法、装置、电子设备和介质
CN112509562B (zh) * 2020-11-09 2024-03-22 北京有竹居网络技术有限公司 用于文本后处理的方法、装置、电子设备和介质

Also Published As

Publication number Publication date
EP3435247A4 (en) 2019-02-27
EP3435247A1 (en) 2019-01-30
CN107220220A (zh) 2017-09-29
US10860798B2 (en) 2020-12-08
CN108475262A (zh) 2018-08-31
US20190018838A1 (en) 2019-01-17

Similar Documents

Publication Publication Date Title
WO2017162134A1 (zh) 用于文本处理的电子设备和方法
CN111832289B (zh) 一种基于聚类和高斯lda的服务发现方法
TW201837746A (zh) 特徵向量的產生、搜索方法、裝置及電子設備
He et al. Cross-modal subspace learning via pairwise constraints
CN112069826B (zh) 融合主题模型和卷积神经网络的垂直域实体消歧方法
US10915707B2 (en) Word replaceability through word vectors
JP2004110161A (ja) テキスト文比較装置
CN107977368B (zh) 信息提取方法及系统
Banik et al. Gru based named entity recognition system for bangla online newspapers
Al Omari et al. Hybrid CNNs-LSTM deep analyzer for arabic opinion mining
CN112434533A (zh) 实体消歧方法、装置、电子设备及计算机可读存储介质
CN115759119A (zh) 一种金融文本情感分析方法、系统、介质和设备
JP2020098592A (ja) ウェブページ内容を抽出する方法、装置及び記憶媒体
CN116932730B (zh) 基于多叉树和大规模语言模型的文档问答方法及相关设备
JP6899973B2 (ja) 意味関係学習装置、意味関係学習方法、及び意味関係学習プログラム
Gonzales Sociolinguistic analysis with missing metadata? Leveraging linguistic and semiotic resources through deep learning to investigate English variation and change on Twitter
Aktas et al. Text classification via network topology: A case study on the holy quran
JP6586055B2 (ja) 深層格解析装置、深層格学習装置、深層格推定装置、方法、及びプログラム
JP2017538226A (ja) スケーラブルなウェブデータの抽出
CN113886530A (zh) 一种语义短语抽取方法及相关装置
CN116415587A (zh) 信息处理装置和信息处理方法
JP5342574B2 (ja) トピックモデリング装置、トピックモデリング方法、及びプログラム
CN112836014A (zh) 一种面向多领域跨学科的专家遴选方法
Shoukry et al. Machine learning and semantic orientation ensemble methods for Egyptian telecom tweets sentiment analysis
Alian et al. Unsupervised learning blocking keys technique for indexing Arabic entity resolution

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2017769411

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017769411

Country of ref document: EP

Effective date: 20181022

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17769411

Country of ref document: EP

Kind code of ref document: A1