WO2017092623A1 - 文本向量表示方法及装置 - Google Patents

文本向量表示方法及装置 Download PDF

Info

Publication number
WO2017092623A1
WO2017092623A1 PCT/CN2016/107312 CN2016107312W WO2017092623A1 WO 2017092623 A1 WO2017092623 A1 WO 2017092623A1 CN 2016107312 W CN2016107312 W CN 2016107312W WO 2017092623 A1 WO2017092623 A1 WO 2017092623A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
feature
corpus
target
topic
Prior art date
Application number
PCT/CN2016/107312
Other languages
English (en)
French (fr)
Inventor
祁国晟
何鑫
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2017092623A1 publication Critical patent/WO2017092623A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the field of natural language processing, and in particular to a text vector representation method and apparatus.
  • Text vector representation is the process of representing unstructured text into mathematical vectors through a series of calculations. It is the basis and premise of many tasks in the field of natural language processing. In tasks such as text categorization, text clustering, and similarity calculation, it is necessary to perform vectorization transformation on the text in advance, and then replace the original text with vectorized text for mathematical operations and statistics. It can be seen that the quality of the text vector will directly affect the analysis results.
  • the general method of text vector representation is to use the Vector Space Model (VSM) to represent text as vectors under several feature dimensions. The ability of the vector to represent the text is related to the way the feature is selected and the way the weight is calculated in each feature dimension.
  • VSM Vector Space Model
  • the text vector representation method selects only a plurality of segmentation words having relative expression ability as candidate features in the feature segmentation of the text.
  • the calculation of feature weights is also based on the calculation of the statistic of the segmentation words in the text.
  • This text vector representation method treats the text as a collection of words, and the generated vector does not really express the semantic information contained in the text.
  • the main purpose of the present application is to provide a text vector representation method and apparatus to solve the problem that the text vector representation method in the related art has weak ability to express semantic information contained in the text.
  • a text vector representation method comprises: obtaining a test text; characterizing the test text to obtain a target text represented by a plurality of text features; and processing the target text by using the pre-stored feature topic relationship matrix to obtain a theme distribution of the target text, wherein the theme distribution includes the target The proportion of the target theme of the text corresponding to the target theme; using the pre-stored feature embedding vector set to extend the text feature describing the target topic, obtaining the target topic feature set, and obtaining a vector representing the target theme according to the target topic feature set; and the theme The distribution and the vector representing the target theme are processed to obtain a vector representing the test text.
  • the method further includes: acquiring a training corpus, wherein the training language The material is used as training corpus; the training corpus is characterized to obtain multiple corpus features; the feature embedding vector of each corpus feature is trained separately to obtain the feature embedding vector set; and multiple themes in the training corpus are obtained; respectively The relationship between each topic and each corpus feature results in a feature topic relationship matrix; and a storage feature embedding vector set and a feature topic relationship matrix.
  • the characterization processing includes word segmentation processing, and characterization processing of the training corpus, and obtaining a plurality of corpus features include: segmenting the training corpus, obtaining a plurality of corpus segmentation results, and characterizing the test text to obtain more
  • the target text represented by the text features includes: segmentation processing the test text, and obtaining multiple text segmentation results.
  • the method further comprises: performing idification processing on each corpus segmentation result separately, and obtaining the first data set after the idification process, wherein The id processing means that each corpus segmentation result corresponds to an id; and the plurality of corpus features are represented by the first data set, and after the word segmentation processing is performed on the test text to obtain a plurality of text segmentation results, the method further includes : performing id processing on each text segmentation result separately, obtaining a second data set after the idization process; and representing the target text by the second data set.
  • the feature embedding vectors of each corpus feature are separately trained to use the Word2vec algorithm to train the feature embedding vectors of each corpus feature.
  • each topic and each corpus feature is separately trained to train the relationship of each topic with each corpus feature using the LDA algorithm.
  • the target text is processed by using the pre-stored feature topic relationship matrix, and the theme distribution of the target text is obtained by: transforming the target text through the pre-stored feature topic relationship matrix according to a preset transformation manner to obtain a theme distribution of the target text, wherein, Let the algorithm used in the transformation method be the LDA algorithm.
  • calculating a theme distribution and a vector representing the target theme, and obtaining a vector representing the test text includes: respectively multiplying a scale corresponding to each target theme by a vector of the target theme; and weighting the multiplied result Summing and getting a vector representing the test text.
  • a text vector representation apparatus includes: a first acquiring unit, configured to acquire test text; a first processing unit, configured to perform characterization processing on the test text to obtain target text represented by the plurality of text features; and a second processing unit, configured to utilize the pre-stored
  • the feature topic relationship matrix processes the target text, and obtains a topic distribution of the target text, wherein the topic distribution includes a proportion of the target theme of the target text corresponding to the target theme; and an expansion unit is configured to use the pre-stored feature embedding vector set to describe the text of the target topic
  • the feature is extended to obtain a target topic feature set, and a vector representing the target theme is obtained according to the target topic feature set; and a calculating unit is configured to perform a calculation process on the theme distribution and the vector representing the target theme to obtain a vector representing the test text.
  • the apparatus further includes: a second acquiring unit, configured to acquire a training corpus, wherein the training corpus is a corpus for training; and a third processing unit is configured to perform characterization processing on the training corpus to obtain multiple corpus features a first training unit for separately training a feature embedding vector of each corpus feature to obtain a feature embedding vector set; a third obtaining unit for acquiring a plurality of topics in the training corpus; and a second training unit for separately training The relationship between each topic and each corpus feature obtains a feature topic relationship matrix; and a storage unit for storing the feature embedding vector set and the feature topic relationship matrix.
  • a second acquiring unit configured to acquire a training corpus, wherein the training corpus is a corpus for training
  • a third processing unit is configured to perform characterization processing on the training corpus to obtain multiple corpus features a first training unit for separately training a feature embedding vector of each corpus feature to obtain a feature embedding vector set
  • a third obtaining unit
  • the characterization processing includes word segmentation processing
  • the third processing unit is further configured to perform word segmentation processing on the training corpus to obtain a plurality of corpus segmentation results
  • the first processing unit is further configured to perform word segmentation processing on the test text to obtain multiple texts. Split the results.
  • the apparatus further includes: a fourth processing unit, configured to perform word segmentation processing on the training corpus, and obtain a plurality of corpus segmentation results, respectively perform idification processing on each corpus segmentation result, and obtain idification process a first data set, wherein the id processing means that each corpus segmentation result corresponds to an id; and the plurality of corpus features are represented by the first data set, the apparatus further comprising: a fifth processing unit, configured to The test text is subjected to word segmentation processing, and after obtaining a plurality of text segmentation results, each text segmentation result is subjected to iidization processing to obtain a second data set after the IDization process; and the target text is represented by the second data set.
  • a fourth processing unit configured to perform word segmentation processing on the training corpus, and obtain a plurality of corpus segmentation results, respectively perform idification processing on each corpus segmentation result, and obtain idification process a first data set, wherein the id processing
  • the feature embedding vectors of each corpus feature are separately trained to use the Word2vec algorithm to train the feature embedding vectors of each corpus feature.
  • each topic and each corpus feature is separately trained to train the relationship of each topic with each corpus feature using the LDA algorithm.
  • the second processing unit is further configured to transform the target text by using a pre-stored feature topic relationship matrix according to a preset transformation manner to obtain a theme distribution of the target text, wherein the algorithm used in the preset transformation mode is an LDA algorithm.
  • the following steps are taken: obtaining a test text; characterizing the test text to obtain a target text represented by a plurality of text features; and processing the target text by using the pre-stored feature topic relationship matrix to obtain a theme distribution of the target text, wherein The theme distribution includes the proportion of the target theme of the target text corresponding to the target theme; using the pre-stored feature embedding vector set to extend the text feature describing the target topic, obtaining the target topic feature set, and obtaining a vector representing the target theme according to the target topic feature set. And calculating the theme distribution and the vector representing the target theme to obtain a vector representing the test text, and solving the problem that the text vector representation method in the related art has weak expression ability on the semantic information contained in the text.
  • the vectorized text is more representative, which makes up for the shortcomings of the text vector representation method in the related art, which has insufficient semantic interpretation ability behind the text literal, and greatly improves the computer-to-text semantics.
  • the degree of understanding achieves an effect of improving the ability to express semantic information contained in the text.
  • FIG. 1 is a flowchart of a text vector representation method according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a text vector representation apparatus in accordance with an embodiment of the present application.
  • a text vector representation method is provided.
  • FIG. 1 is a flow chart of a text vector representation method in accordance with an embodiment of the present application. As shown in Figure 1, the method includes the following steps:
  • Step S101 obtaining test text.
  • the method before acquiring the test text, the method further includes: acquiring a training corpus, wherein the training corpus is a corpus for training; and characterizing the training corpus, Obtaining multiple corpus features; respectively training the feature embedding vector of each corpus feature to obtain the feature embedding vector set; obtaining multiple themes in the training corpus; respectively training the relationship between each topic and each corpus feature, A feature topic relationship matrix is obtained; and a feature feature embedding vector set and a feature topic relationship matrix are stored.
  • the characterization processing includes word segmentation processing, and the training corpus is characterized, and the plurality of corpus features are obtained: performing segmentation processing on the training corpus to obtain multiple corpora Split the results.
  • each corpus segmentation result is subjected to id processing to obtain the first data set after the idization process, wherein the idification process refers to each The corpus segmentation result corresponds to an id; and the plurality of corpus features are represented by the first data set.
  • the feature embedding vector of each corpus feature is separately trained to use the Word2vec algorithm to train the feature embedding vector of each corpus feature; respectively, each theme and each corpus are trained.
  • the relationship of features is to train the relationship between each topic and each corpus feature using the LDA algorithm.
  • the training corpus is composed of a large amount of text to be processed, and the training corpus may be from materials and files, or may be from the Internet.
  • the above characterization process is a process of representing a training corpus as a corpus feature set, where the corpus feature is used to represent a certain feature of the training corpus.
  • the types of corpus features can be varied. For example, each word in the training corpus can be called a corpus feature.
  • any phrase consisting of two adjacent words, or a question, a transition relationship, etc. contained in the training corpus Can be used as a corpus feature.
  • the word segmentation processing of the training corpus is employed as the characterization process in the present application.
  • each corpus segmentation result can be a corpus segmentation word, or a phrase composed of any two adjacent corpus segmentation words, or a question, a transition relationship, and the like included in the training corpus. Therefore, the number of all corpus segmentation results contained in the entire training corpus is equivalent to the number of all corpus features, and the number is recorded as D.
  • the corpus features are subjected to id processing such that each corpus feature corresponds to a unique id, and id is used as a representation of the feature for subsequent processing.
  • the id is a hash code generated for each corpus feature or an auto-increment key with an initial value of 0. No matter how the field representing the id is int or long, the space is very large compared to the complex feature. Big compression. Therefore, the spatial performance is optimized.
  • the vector dimension n is set, the n-dimensional feature embedding vector of each corpus feature in the training corpus is embedded, and the output feature embedding vector set.
  • the so-called feature embedding vector is a mathematical vector representation of the feature. It can be approximated that the feature embedding vector has the information amount of the feature itself. Therefore, a representation of the mathematical vectorization of all features is obtained through the training process.
  • the parameter n that needs to be set represents the dimension that is expected to be converted into a vector, so the parameter n set here is also the dimension of the subsequent vector vector representing the vector.
  • the value of n is not recommended to be too large, consider the calculation process And storage, each feature has an array of length n, and there are D features, which requires a total of n * D values, therefore, the value of n is related to the spatial complexity, in addition, after a more appropriate size, embedded The vector is enough to represent the feature, and increasing the size of n will be less and less helpful in representing more information. Therefore, the recommended value for n is 200.
  • the feature embedding vector for training each corpus feature can use the Word2vec algorithm proposed by Google, which uses the 3-layer neural network to train the feature, and can obtain the feature embedding vector set required in the present application.
  • the specific algorithm for training the feature embedding vector of each corpus feature is not limited.
  • the m setting the number of topics is similar to the vector dimension n described above, and m is a parameter related to the spatial complexity, indicating the number of implicit topics in the semantic field.
  • a topic is a hypothetical concept that cannot be explained directly by description, but rather by a subject that can describe the features of the topic.
  • the problem of modeling the theme is usually faced with a problem that the number of topics is difficult to determine.
  • m* themes in the actual semantic field, but it is difficult to automatically optimize m* during the training process, it is necessary to make an approximate expression of m*, that is, m.
  • the approximate description of the number of real topics lies in mastering the information in all the semantic fields as much as possible. If m ⁇ m*, some topic information is not represented. If m>m*, all m* topics have their corresponding representations. At the same time, there are additional mm* topics, which are different errors or The part that overlaps with m* topics. Therefore, for the present application, selecting a relatively large m as the number of topics is a relatively safe strategy that does not cause information loss. Given the consideration of space complexity, the reference value of m is given as 50 to 100. In the present application, the LDA algorithm is used to implement the relationship between the training theme and each corpus feature. In addition, an alternative method such as PLSA or SVD can be used to implement the training process.
  • Step S102 performing characterization processing on the test text to obtain target texts represented by the plurality of text features.
  • test text obtained above is characterized.
  • Chinese word segmentation is used as a means of characterization processing, that is, a text is input, and the program performs Chinese word segmentation on the text to obtain a sequence of text features, wherein each text feature It can be a text feature word, or a phrase composed of any two adjacent text feature words, or a question, a transition relationship, and the like contained in the test text.
  • the textual features here correspond to the corpus features described above.
  • the test text is characterized, and the target text represented by the plurality of text features is obtained by: performing word segmentation on the test text to obtain multiple text segmentation results. .
  • each text segmentation result is subjected to id processing to obtain a second data set after the IDization process; and the target data is represented by the second data set.
  • the text segmentation result may be a text segmentation word, or a phrase composed of any two adjacent text segmentation words, or a question, a transition relationship, and the like included in the test text.
  • the purpose of the idification process for each text segmentation result is the same as that of the corpus segmentation result, and will not be described here.
  • the target text is processed by using the pre-stored feature topic relationship matrix, and the topic distribution of the target text is obtained by: presetting the target text through the pre-stored feature topic relationship matrix according to a preset transformation.
  • the method performs transformation to obtain a topic distribution of the target text, wherein the algorithm used in the preset transformation mode is an LDA algorithm.
  • the algorithm used in the preset transformation mode is consistent with the algorithm used to train the relationship between each topic and each corpus feature, and the specific algorithm is not limited in the present application.
  • the corpus feature and the id in the feature topic relationship matrix need to be matched, and the id matching the same corpus feature is used as the return value, and the feature id sequence is output.
  • the characterization process steps of the training corpus need to be consistent with the characterization process steps of the test text, that is, the same characterization processing logic is required to ensure that the processed feature sets can be compared.
  • Step S103 The target text is processed by using the pre-stored feature topic relationship matrix to obtain a topic distribution of the target text, wherein the topic distribution includes a ratio of the target topic of the target text to the target theme.
  • the corpus data of the training feature topic relationship matrix can also be input to the system, and the system can pre-train the feature topic relationship matrix in advance, and store the model directly in the memory, and then directly perform online processing.
  • the feature topic relationship matrix obtained by the training process can be represented as a matrix.
  • the dimension of the matrix is m*D, that is, each row represents a topic, and each column represents a corpus feature.
  • the values in the matrix indicate the extent to which the corpus features belong to the topic. This degree is referred to in this application as the degree of membership.
  • the meanings and ranges of membership degrees are not the same in the algorithms of different training subject relationship matrix, but in general, the relative comparison of membership degrees has reference significance, that is, the greater the degree of membership, The more the corpus feature is the subject, the more the corpus feature can be used to express the topic.
  • the description of the abstract concept of the subject can be approximated by extracting several corpus features that best describe the subject.
  • the input target text represented by the text feature may be transformed by the feature topic relationship matrix into a target text represented by the topic distribution.
  • the specific transformation method is related to the structure of the feature topic relation matrix and the algorithm of the training feature topic relationship matrix.
  • the training process uses the LDA algorithm. Therefore, the LDA algorithm is also used to perform the above transformation, and the details are as follows:
  • a random number of 0 to m-1 is generated for each text feature of the target text represented by the text feature, as the text feature belongs to the subject information contribution to the original text vote. It should be noted that this kind of voting is completely based on randomness.
  • the online process will iterate over the topic votes on each of the text features in the text. Every time you traverse all the text features in the text, it is counted as one iteration.
  • the number of iterations can be input as parameters beforehand. Usually, the reference value is 10 or more, and you can increase or decrease according to the number of topics.
  • Equation 1 For each topic voting process to be updated, decide which topic to vote for the next time is mainly affected by two factors, one is the distribution of voting topics at all times in the text, and the other is the text feature in the topic feature.
  • the relationship matrix belongs to the distribution under different themes. For each possible next vote value, you can get the probability of voting for the topic, Equation 1 is as follows:
  • Equation 1 the probability of the next possible vote for each topic can be calculated.
  • the sampling process of the algorithm will vote for the probability to the corresponding topic, as follows:
  • the generated random number is 0.88
  • traverse the 0th number 0.2, judge 0.8>0.2 continue the next layer traversal
  • traverse the first number 0.7, judge 0.8>0.7 continue the next layer traversal
  • traverse the second The number 1, judged by 0.8 ⁇ 1 returns to the traversal position 2.
  • Step S104 Extend the text feature describing the target topic by using the pre-stored feature embedding vector set to obtain the target topic feature set, and obtain a vector representing the target theme according to the target topic feature set.
  • the corpus data of the training feature embedded vector set may also be input to the system, and the system may pre-train the feature embedding vector set in advance, and store the feature embedding vector set directly in the memory, and then directly perform online processing.
  • the topic distribution of the text is obtained in step S103, but the current topic distribution is still generated by the features owned by the text itself.
  • the representation of each topic is also based on each individual feature. Therefore, the concepts of each used feature are not related to each other, that is, independent. For example, currently got The interim result of this paper may be that 20% of the target text describes music, 30% describes sports, and 50% describes computers.
  • the features with high degree of membership are ⁇ music: ["folk”, “jazz”, “classical”], sports: ["soccer”, “basketball”, “volleyball”], computer: [”hard disk” , "memory”, "graphics card”] ⁇ and so on.
  • the specific implementation is based on the feature embedded vector model obtained during the training process.
  • the feature embedding vector of each feature in the training process is obtained, and the embedding vector dimension is n, and the vector can be approximated to be able to represent all the information of the feature.
  • it can be interpreted as a point in n-dimensional space, or a vector pointed to by the origin. Therefore, different feature embedding vectors can calculate the distance of their vectors, because the embedding vectors represent features, so the specificity of the embedding vectors is also the distance between the corresponding features.
  • the feature set according to the description topic is used as the source of the description topic, and the feature of the source is extended based on the semantic feature.
  • the extension logic is to calculate other features similar to the feature distance, and describe the topic together with other features. For example, in the above example, the distance between “badminton” and “soccer”, “basketball” and “volleyball” is calculated to be close enough, and “badminton” is also a feature belonging to the description of the theme “sports”, and “ The display "cannot be used as a feature of the expression "sports.”
  • a rich feature set describing the topic can be obtained, and then a semantic level vector representation of the specific topic is required.
  • a vector describing a topic also needs to rely on a feature set that describes the topic.
  • Geometrically a number of points (features) in space are clustered together to form a cluster of points in space that are collectively describing a topic. While a topic corresponds to a certain range of regions in space, it is necessary to represent this region with a vector approximation.
  • One of the best ways is the center of gravity of the region, that is, the geometric midpoint.
  • the subject area is unknown, the subject area is approximated by the feature point cluster, that is, the arithmetic mean of all the feature vectors is calculated. So far, the vector representation of the subject at the semantic level is obtained, and the dimension of the vector is equal to the dimension n of the feature embedding vector.
  • Step S105 performing calculation processing on the theme distribution and the vector representing the target theme to obtain a vector representing the test text.
  • the theme distribution and the vector representing the target theme are calculated, and the vector representing the test text is obtained by: respectively, the proportion corresponding to each target theme and the target theme.
  • the vector is multiplied; and the multiplied result is weighted and summed to obtain a test text.
  • the vector of this is the case distribution and the vector representing the target theme.
  • step S103 the topic distribution of the text is obtained, that is, the distribution of the space used by the text under each theme.
  • step S104 a vector representation of each topic on a semantic level is obtained.
  • the text vector representation method provided by the embodiment of the present application is a text vector representation method based on text topic and semantics, which greatly compensates the implicit semantic interpretation ability of the text vector representation method in the related art. Insufficient shortcomings, able to represent the information behind the text, with a better ability to express.
  • the text vector representation method provided by the embodiment of the present application supports a single text input and returns a vector representation of the text in the online processing process, and does not need to rely on multiple text input corpus batch input, such data stream input support More in line with actual functional needs.
  • the text vector obtained by the text vector representation method provided by the embodiment of the present application has higher evaluation results when applied to natural language processing tasks such as subsequent text classification and similar article calculation, and also greatly improves the degree of computer semantic understanding of the text.
  • the text vector representation method obtained by the embodiment of the present application obtains the test text, performs characterization processing on the test text, and obtains target texts represented by multiple text features; uses the pre-stored feature topic relationship matrix to process the target text, and obtains the theme of the target text.
  • the topic distribution includes a proportion of the target theme of the target text corresponding to the target theme; using the pre-stored feature embedding vector set to extend the text feature describing the target topic, obtaining the target topic feature set, and obtaining the representation according to the target topic feature set The vector of the target theme; and the calculation of the theme distribution and the vector representing the target theme, and the vector representing the test text is obtained, which solves the problem that the text vector representation method in the related art has weak ability to express the semantic information contained in the text.
  • the vectorized text is more representative, which makes up for the shortcomings of the text vector representation method in the related art, which has insufficient semantic interpretation ability behind the text literal, and greatly improves the computer-to-text semantics.
  • the degree of understanding achieves an effect of improving the ability to express semantic information contained in the text.
  • the embodiment of the present application further provides a text vector representation apparatus. It should be noted that the text vector representation apparatus of the embodiment of the present application may be used to execute the text vector representation method provided by the embodiment of the present application.
  • the text vector representation device provided by the embodiment of the present application is introduced below.
  • the apparatus includes a first acquisition unit 10, a first processing unit 20, a second processing unit 30, an expansion unit 40, and a calculation unit 50.
  • the first obtaining unit 10 is configured to acquire test text.
  • the first processing unit 20 is configured to perform characterization processing on the test text to obtain target text represented by the plurality of text features.
  • the second processing unit 30 is configured to process the target text by using the pre-stored feature topic relationship matrix to obtain a topic distribution of the target text, wherein the topic distribution includes a ratio of the target topic of the target text to the target topic.
  • the expansion unit 40 is configured to expand a text feature describing the target topic by using the pre-stored feature embedding vector set to obtain a target topic feature set, and obtain a vector representing the target theme according to the target topic feature set.
  • the calculating unit 50 is configured to perform a calculation process on the topic distribution and the vector representing the target topic to obtain a vector representing the test text.
  • the first obtaining unit 10, the first processing unit 20, the second processing unit 30, the extension unit 40 and the computing unit 50 may be operated as part of the device in the computer terminal, and may be in the computer terminal.
  • the processor is configured to perform the functions implemented by the above modules, and the computer terminal may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.
  • MID mobile Internet device
  • the text vector representation device obtained by the embodiment of the present application obtains the test text by the first obtaining unit 10; the first processing unit 20 performs characterization processing on the test text to obtain target texts represented by the plurality of text features; and the second processing unit 30 utilizes The pre-stored feature topic relationship matrix processes the target text to obtain a topic distribution of the target text, wherein the topic distribution includes a ratio of the target topic of the target text to the target topic; and the extension unit 40 uses the pre-stored feature embedding vector set to describe the text of the target topic.
  • the feature is expanded to obtain a target topic feature set, and a vector representing the target theme is obtained according to the target topic feature set; and the calculating unit 50 performs a calculation process on the theme distribution and the vector representing the target theme to obtain a vector representing the test text, and the related information is solved.
  • the text vector representation method in the technology has a weak ability to express the semantic information contained in the text.
  • the vectorized text is more representative, which makes up for the shortcomings of the text vector representation method in the related art, which has insufficient semantic interpretation ability behind the text literal, and greatly improves the computer-to-text semantics.
  • the degree of understanding achieves an effect of improving the ability to express semantic information contained in the text.
  • the apparatus further includes: a second obtaining unit, configured to acquire a training corpus, wherein the training corpus is a corpus for training; and the third processing unit uses Characterizing the training corpus to obtain multiple corpus features; the first training unit is configured to separately train the feature embedding vector of each corpus feature to obtain a feature embedding vector set; and the third obtaining unit is configured to acquire the training corpus a plurality of topics; a second training unit, configured to separately train each topic with each corpus feature to obtain a feature topic relationship matrix; and a storage unit for storing the feature embedding vector set and the feature topic relationship matrix.
  • a second obtaining unit configured to acquire a training corpus, wherein the training corpus is a corpus for training
  • the third processing unit uses Characterizing the training corpus to obtain multiple corpus features
  • the first training unit is configured to separately train the feature embedding vector of each corpus feature to obtain a feature embedding vector set
  • the third obtaining unit is configured to acquire the training corpus a plurality
  • the characterization processing includes word segmentation processing
  • the third processing unit is further configured to perform word segmentation processing on the training corpus to obtain multiple corpus segmentation results
  • the first processing unit It is also used for word segmentation of test texts to obtain multiple text segmentation results.
  • the apparatus further includes: a fourth processing unit, configured to perform word segmentation on the training corpus to obtain a plurality of corpus segmentation results, respectively
  • the corpus segmentation result is subjected to id processing to obtain a first data set after the idification process, wherein the idification process refers to that each corpus segmentation result corresponds to one id; and the first data set represents multiple corpus features.
  • the apparatus further includes: a fifth processing unit, configured to perform word segmentation processing on the test text, obtain a plurality of text segmentation results, and perform id processing on each text segmentation result respectively, to obtain a second after the idization process a collection of data; and representing the target text by the second set of data.
  • a fifth processing unit configured to perform word segmentation processing on the test text, obtain a plurality of text segmentation results, and perform id processing on each text segmentation result respectively, to obtain a second after the idization process a collection of data; and representing the target text by the second set of data.
  • the feature embedding vector of each corpus feature is separately trained to use the Word2vec algorithm to train the feature embedding vector of each corpus feature.
  • the relationship between each topic and each corpus feature is separately trained to train the relationship between each topic and each corpus feature by using an LDA algorithm.
  • the second processing unit 30 is further configured to transform the target text by using a pre-stored feature topic relationship matrix according to a preset transformation manner to obtain a topic distribution of the target text.
  • the algorithm used in the preset transformation mode is an LDA algorithm.
  • the various functional units provided by the embodiments of the present application may be operated in a mobile terminal, a computer terminal, or the like, or may be stored as part of a storage medium.
  • embodiments of the present invention may provide a computer terminal, which may be any computer terminal device in a group of computer terminals.
  • a computer terminal may also be replaced with a terminal device such as a mobile terminal.
  • the computer terminal may be located in at least one network device of the plurality of network devices of the computer network.
  • the computer terminal may execute the program code of the following steps in the text vector representation method: acquiring the test text; characterizing the test text to obtain the target text represented by the plurality of text features; and using the pre-stored feature topic relationship
  • the matrix processes the target text and obtains the topic distribution of the target text.
  • the theme distribution includes the proportion of the target theme of the target text corresponding to the target theme.
  • the pre-stored feature embedding vector set is used to extend the text feature describing the target topic to obtain the target topic feature. Collecting, and obtaining a vector representing the target theme according to the target topic feature set; and performing calculation processing on the theme distribution and the vector representing the target theme to obtain a vector representing the test text.
  • the computer terminal can include: one or more processors, memory, and transmission means.
  • the memory can be used to store software programs and modules, such as a text vector representation method and a program instruction/module corresponding to the device in the embodiment of the present invention, and the processor executes various functions by running a software program and a module stored in the memory. Application and data processing, that is, the above-described text vector representation method is implemented.
  • the memory may include a high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • the memory can further include memory remotely located relative to the processor, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the above transmission device is for receiving or transmitting data via a network.
  • Specific examples of the above network may include a wired network and a wireless network.
  • the transmission device includes a Network Interface Controller (NIC) that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network.
  • the transmission device is a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • the memory is used to store preset action conditions and information of the preset rights user, and an application.
  • the processor can call the memory stored information and the application by the transmitting device to execute the program code of the method steps of each of the alternative or preferred embodiments of the above method embodiments.
  • the computer terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.
  • a smart phone such as an Android phone, an iOS phone, etc.
  • a tablet computer such as an iPad, Samsung Galaxy Tab, Samsung Galaxy Tab, etc.
  • MID mobile Internet device
  • PAD PAD
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium may be used to save program code executed by the text vector representation method provided by the foregoing method embodiment and the device embodiment.
  • the foregoing storage medium may be located in any one of the computer terminal groups in the computer network, or in any one of the mobile terminal groups.
  • the storage medium is configured to store program code for performing the following steps: acquiring test text; characterizing the test text to obtain target text represented by the plurality of text features; utilizing pre-stored The feature topic relationship matrix processes the target text, and obtains the topic distribution of the target text, wherein the topic distribution includes the proportion of the target theme of the target text corresponding to the target theme; and the pre-stored feature embedding vector set is used to expand the text feature describing the target topic, and A target theme feature set, and a vector representing the target theme is obtained according to the target topic feature set; and the theme distribution and the vector representing the target theme are calculated to obtain a vector representing the test text.
  • the storage medium may also be arranged to store program code of various preferred or optional method steps provided by the text vector representation method.
  • the text vector representation apparatus includes a processor and a memory, and the first acquisition unit, the first processing unit, the second processing unit, the extension unit, the calculation unit, and the like are all stored as a program unit in a memory, and are stored in the memory by the processor.
  • the above program unit in the middle implements the corresponding function.
  • the processor contains a kernel, and the kernel removes the corresponding program unit from the memory.
  • the kernel can be set to one or more, representing the text vector by adjusting the kernel parameters.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory (flash RAM), the memory including at least one Memory chip.
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash memory
  • the present application also provides an embodiment of a computer program product, when executed on a data processing device, adapted to perform program code for initializing the following method steps: obtaining test text; characterizing the test text to obtain a plurality of The target text represented by the text feature; the target text is processed by using the pre-stored feature topic relationship matrix, and the topic distribution of the target text is obtained, wherein the topic distribution includes a ratio of the target theme of the target text to the target theme; and the pre-stored feature embedding vector set pair
  • the text features describing the target theme are expanded to obtain a target theme feature set, and a vector representing the target theme is obtained according to the target theme feature set; and the theme distribution and the vector representing the target theme are calculated to obtain a vector representing the test text.
  • the disclosed apparatus may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • modules or steps of the present application can be implemented by a general computing device, which can be concentrated on a single computing device or distributed among multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, such that they may be stored in the storage device by the computing device, or they may be fabricated into individual integrated circuit modules, or Implementing multiple modules or steps in them as a single integrated circuit module. Thus, the application is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种文本向量表示方法及装置。该方法包括:获取测试文本(S101);对测试文本进行特征化处理,得到多个文本特征表示的目标文本(S102);利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例(S103);利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量(S104);以及对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量(S105)。解决了相关技术中的文本向量表示方法对文本包含的语义信息的表达能力较弱的问题。

Description

文本向量表示方法及装置 技术领域
本申请涉及自然语言处理领域,具体而言,涉及一种文本向量表示方法及装置。
背景技术
文本向量表示是将非结构化的文本通过一系列计算表示成数学向量的过程,是自然语言处理领域很多任务的基础和前提。在文本分类、文本聚类、相似度计算等任务中,都需要预先对文本进行向量化变换,然后用向量化的文本代替原来的文本进行数学运算和统计。由此可见,文本向量表示的好坏将直接影响到后面分析结果。目前,文本向量表示的一般方法是使用向量空间模型(Vector Space Model,简称VSM),将文本表示成若干特征维度下的向量。而向量表示文本的能力强弱则与特征的选取方式和每一个特征维度下权重的计算方式有关。相关技术中文本向量表示方法在特征选择上仅仅是在文本的切分词集合中选择若干相对具有表达能力的切分词作为候选特征。而特征权重的计算也是基于切分词在文本中的统计量计算获得。这种文本向量表示方法将文本割裂的看作是词语的集合,所产生的向量也并不能真正表达文本包含的语义信息。
针对相关技术中的文本向量表示方法对文本包含的语义信息的表达能力较弱的问题,目前尚未提出有效的解决方案。
发明内容
本申请的主要目的在于提供一种文本向量表示方法及装置,以解决相关技术中的文本向量表示方法对文本包含的语义信息的表达能力较弱的问题。
为了实现上述目的,根据本申请的一个方面,提供了一种文本向量表示方法。该方法包括:获取测试文本;对测试文本进行特征化处理,得到多个文本特征表示的目标文本;利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例;利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量;以及对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量。
进一步地,在获取测试文本之前,该方法还包括:获取训练语料,其中,训练语 料为用于训练的语料;对训练语料进行特征化处理,得到多个语料特征;分别训练每个语料特征的特征嵌入向量,得到特征嵌入向量集合;获取训练语料中的多个主题;分别训练每个主题与每个语料特征的关系,得到特征主题关系矩阵;以及存储特征嵌入向量集合和特征主题关系矩阵。
进一步地,特征化处理包括分词处理,对训练语料进行特征化处理,得到多个语料特征包括:对训练语料进行分词处理,得到多个语料切分结果,对测试文本进行特征化处理,得到多个文本特征表示的目标文本包括:对测试文本进行分词处理,得到多个文本切分结果。
进一步地,在对训练语料进行分词处理,得到多个语料切分结果之后,该方法还包括:分别对每个语料切分结果进行id化处理,得到id化处理后的第一数据集合,其中,id化处理是指将每个语料切分结果对应一个id;以及通过第一数据集合表示多个语料特征,在对测试文本进行分词处理,得到多个文本切分结果之后,该方法还包括:分别对每个文本切分结果进行id化处理,得到id化处理后的第二数据集合;以及通过第二数据集合表示目标文本。
进一步地,分别训练每个语料特征的特征嵌入向量为采用Word2vec算法训练每个语料特征的特征嵌入向量。
进一步地,分别训练每个主题与每个语料特征的关系为采用LDA算法训练每个主题与每个语料特征的关系。
进一步地,利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布包括:将目标文本通过预存的特征主题关系矩阵按照预设变换方式进行变换,得到目标文本的主题分布,其中,预设变换方式中采用的算法为LDA算法。
进一步地,对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量包括:分别将每个目标主题对应的比例与目标主题的向量进行相乘;以及将相乘后结果进行加权求和,得到表示测试文本的向量。
为了实现上述目的,根据本申请的另一方面,提供了一种文本向量表示装置。该装置包括:第一获取单元,用于获取测试文本;第一处理单元,用于对测试文本进行特征化处理,得到多个文本特征表示的目标文本;第二处理单元,用于利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例;扩展单元,用于利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量;以及计算单元,用于对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量。
进一步地,该装置还包括:第二获取单元,用于获取训练语料,其中,训练语料为用于训练的语料;第三处理单元,用于对训练语料进行特征化处理,得到多个语料特征;第一训练单元,用于分别训练每个语料特征的特征嵌入向量,得到特征嵌入向量集合;第三获取单元,用于获取训练语料中的多个主题;第二训练单元,用于分别训练每个主题与每个语料特征的关系,得到特征主题关系矩阵;以及存储单元,用于存储特征嵌入向量集合和特征主题关系矩阵。
进一步地,特征化处理包括分词处理,第三处理单元还用于对训练语料进行分词处理,得到多个语料切分结果,第一处理单元还用于对测试文本进行分词处理,得到多个文本切分结果。
进一步地,该装置还包括:第四处理单元,用于在对训练语料进行分词处理,得到多个语料切分结果之后,分别对每个语料切分结果进行id化处理,得到id化处理后的第一数据集合,其中,id化处理是指将每个语料切分结果对应一个id;以及通过第一数据集合表示多个语料特征,该装置还包括:第五处理单元,用于在对测试文本进行分词处理,得到多个文本切分结果之后,分别对每个文本切分结果进行id化处理,得到id化处理后的第二数据集合;以及通过第二数据集合表示目标文本。
进一步地,分别训练每个语料特征的特征嵌入向量为采用Word2vec算法训练每个语料特征的特征嵌入向量。
进一步地,分别训练每个主题与每个语料特征的关系为采用LDA算法训练每个主题与每个语料特征的关系。
进一步地,第二处理单元还用于将目标文本通过预存的特征主题关系矩阵按照预设变换方式进行变换,得到目标文本的主题分布,其中,预设变换方式中采用的算法为LDA算法。
通过本申请,采用以下步骤:获取测试文本;对测试文本进行特征化处理,得到多个文本特征表示的目标文本;利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例;利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量;以及对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量,解决了相关技术中的文本向量表示方法对文本包含的语义信息的表达能力较弱的问题。通过引入了主题和文本特征(语义)使得向量化的文本更具有表示能力,弥补了相关技术中文本向量表示方法对于文本字面背后隐含语义解释能力不足的缺点,也大大提高了计算机对文本语义理解的程度,进而达到了提升对文本包含的语义信息的表达能力的效果。
附图说明
构成本申请的一部分的附图用来提供对本申请的进一步理解,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是根据本申请实施例的文本向量表示方法的流程图;以及
图2是根据本申请实施例的文本向量表示装置的示意图。
具体实施方式
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
根据本申请的实施例,提供了一种文本向量表示方法。
图1是根据本申请实施例的文本向量表示方法的流程图。如图1所示,该方法包括以下步骤:
步骤S101,获取测试文本。
获取一条输入系统的文本并将其作为本申请中的测试文本。
优选地,在本申请实施例提供的文本向量表示方法中,在获取测试文本之前,该方法还包括:获取训练语料,其中,训练语料为用于训练的语料;对训练语料进行特征化处理,得到多个语料特征;分别训练每个语料特征的特征嵌入向量,得到特征嵌入向量集合;获取训练语料中的多个主题;分别训练每个主题与每个语料特征的关系, 得到特征主题关系矩阵;以及存储特征嵌入向量集合和特征主题关系矩阵。
可选地,在本申请实施例提供的文本向量表示方法中,特征化处理包括分词处理,对训练语料进行特征化处理,得到多个语料特征包括:对训练语料进行分词处理,得到多个语料切分结果。
在对训练语料进行分词处理,得到多个语料切分结果之后,分别对每个语料切分结果进行id化处理,得到id化处理后的第一数据集合,其中,id化处理是指将每个语料切分结果对应一个id;以及通过第一数据集合表示多个语料特征。
可选地,在本申请实施例提供的文本向量表示方法中,分别训练每个语料特征的特征嵌入向量为采用Word2vec算法训练每个语料特征的特征嵌入向量;分别训练每个主题与每个语料特征的关系为采用LDA算法训练每个主题与每个语料特征的关系。
具体地,首先,通过大量需要处理的文本组成训练语料,训练语料可以来自资料和文件,也可以来自互联网。上述的特征化处理,是将训练语料表示为语料特征集合的过程,此处的语料特征用于表示训练语料的某处特点。语料特征的种类可以有多样,例如,训练语料中的每个词都可以称作一个语料特征,此外,任何相邻两个词组成的词组,或者训练语料中包含的疑问、转折关系等等,都可以作为语料特征。在本申请中采用对训练语料的分词处理作为特征化处理。对训练语料中每一条文本进行中文分词处理,得到多个语料切分结果。其中,每一个语料切分结果可以是一个语料切分词,也可以是任何相邻两个语料切分词组成的词组,或者训练语料中包含的疑问、转折关系等等。因此,整个训练语料中所包含全部语料切分结果的个数,就相当于全部语料特征的数量,数量记为D。
通过提前训练语料,得到多个语料特征,提升后续得到特征嵌入向量集合和特征主题关系矩阵的处理效率。
为了优化空间性能,在本申请中,将语料特征(语料切分结果)进行id化处理,使得每一个语料特征对应一个唯一的id,使用id作为对特征的表示方式,进行后续处理。通常id是对每一个语料特征生成的哈希码或者以0为初值的自增键,无论怎样表示id的字段都是int或long型,相比复杂的特征而言,空间上得到了很大的压缩。因此对空间性能进行了优化。
其次,设置向量维度n,训练语料中每个语料特征的n维特征嵌入向量,输出特征嵌入向量集合。所谓特征嵌入向量,是对特征的数学向量表示,可以近似的认为,特征嵌入向量具有特征本身全部的信息量。因此,通过训练过程得到全部特征数学向量化的表示。需要设置的参数n表示期望转化为向量的维度,因此,此处设置的参数n同时也是后续文本向量表示后向量的维度。一般,n的值不建议过大,考虑计算过程 和存储,每一个特征有长度为n的数组,并且有D个特征,就共需要n*D个值,因此,n的取值与空间复杂度有关,此外,在较为合适的大小以后,嵌入向量已经足够对特性进行表示,再增加n的大小对表示更多信息的帮助也会越来越小,因此,n建议的参考值是200。在本申请中训练每个语料特征的特征嵌入向量可以使用Google提出的Word2vec算法,该算法使用3层神经网络对特征进行训练,可以得到本申请中需要的特征嵌入向量集合,在本申请中对训练每个语料特征的特征嵌入向量的具体算法不作限定。
最后,设置主题个数m,训练语料中所隐含主题与每个语料特征的m*D维关系矩阵,输出特征主题关系矩阵。设置主题个数的m与上述的向量维度n类似,m是与空间复杂度有关的参数,表示的是语义场中隐含主题的个数。通过训练语料,需要训练出训练语料背后隐含的主题。主题是一个假想的概念,无法直接通过描述来解释,而是通过可以描述主题的特征的集合来近似的解释主题。但是,通常对主题建模的情况都面临着一个问题,就是主题个数m难以确定。若实际语义场中存在主题m*个,但是很难在训练的过程中自动的优化出m*,因此,需要对m*进行近似的表述,即m。近似描述真实主题数量在于掌握尽量全部语义场中的信息。若m<m*,则造成部分主题信息未得到表示,若m>m*,则全部m*个主题的信息均有其相应的表示,同时,还额外存在m-m*个主题,是区分错误或者与m*个主题相重叠的部分。因此,对于本申请而言,选取相对较大的m作为主题个数,是不会造成信息丢失的较为安全的策略。在兼顾空间复杂度的考虑下,给出m的参考值为50至100即可。在本申请中,使用LDA算法实现训练主题与每个语料特征关系,除此之外,还可以采用PLSA、SVD等可替代的方法实现该训练过程。
步骤S102,对测试文本进行特征化处理,得到多个文本特征表示的目标文本。
对上述得到的测试文本进行特征化处理,本申请中以中文分词作为特征化处理的手段,即输入一条文本,程序将对文本进行中文分词处理,得到文本特征的序列,其中,每一个文本特征可以是一个文本特征词,也可以是任何相邻两个文本特征词组成的词组,或者测试文本中包含的疑问、转折关系等等。此处的文本特征与上述的语料特征相对应。
可选地,在本申请实施例提供的文本向量表示方法中,对测试文本进行特征化处理,得到多个文本特征表示的目标文本包括:对测试文本进行分词处理,得到多个文本切分结果。
在对测试文本进行分词处理,得到多个文本切分结果之后,分别对每个文本切分结果进行id化处理,得到id化处理后的第二数据集合;以及通过第二数据集合表示目标文本。
需要说明的是,文本切分结果可以是一个文本切分词,也可以是任何相邻两个文本切分词组成的词组,或者测试文本中包含的疑问、转折关系等等。在此对每个文本切分结果进行id化处理的目的与语料切分结果进行id化处理目的相同,在此不在赘述。
可选地,在本申请实施例提供的文本向量表示方法中,利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布包括:将目标文本通过预存的特征主题关系矩阵按照预设变换方式进行变换,得到目标文本的主题分布,其中,预设变换方式中采用的算法为LDA算法。
需要说明的是,在本申请中预设变换方式中采用的算法与训练每个主题与每个语料特征的关系采用的算法保持一致即可,在本申请中对其具体算法不作限定。
当经过中文分词过程得到切分词后,需要与特征主题关系矩阵中的语料特征与id进行匹配,将匹配到相同语料特征的id作为返回值,输出特征id序列。同时需要注意的是,在上述对训练语料的特征化处理步骤需要与对测试文本特征化处理步骤保持一致,即需要具有相同的特征化处理逻辑,才能保证处理出的特征集合可以比较。
需要说明的是,上述过程存在部分切分词并未出现在之前的训练语料中的情况,也就是在训练过程中并不存在相对应的特征的情形,可以直接丢弃掉未出现特征。
步骤S103,利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例。
需要在系统内存中持久化存储上述得到的特征主题关系矩阵,在初始化时直接从存储媒介中读入特征主题关系矩阵到内存中即可。另外,也可以将训练特征主题关系矩阵的语料数据输入给系统,让系统预先在线训练特征主题关系矩阵,并将模型直接存储在内存中,后续直接进行在线处理即可。
训练过程所得到的特征主题关系矩阵可以表示为一个矩阵,矩阵的维度为m*D,即每一行表示一个主题,每一列表示一个语料特征,矩阵中的值表示语料特征属于主题的程度,在本申请中称该程度为隶属度。隶属度的含义和值域在不同训练征主题关系矩阵的算法中表示的结果也并不相同,但是总的来说,隶属度的相对比较关系是具有可参考意义的,即隶属度越大,表示语料特征越属于主题,语料特征越能够用于表述主题。因此,对于主题这一抽象概念的描述,可以通过提取最能够描述主题的若干个语料特征的方式近似表示。
该步骤中,输入的由文本特征表示的目标文本可以通过特征主题关系矩阵进行变换,变换成有主题分布表示的目标文本。同样,具体变换的方式与特征主题关系矩阵的结构和训练特征主题关系矩阵的算法有关,在上文中训练过程使用LDA算法,因此,在此同样采取LDA算法进行上述变换,细节如下:
初始过程,对由文本特征表示的目标文本的每一个文本特征都生成一个0至m-1的随机数,作为该文本特征属于对原文本贡献的主题信息投票。需要说明的是,这种投票是完全基于随机的。
在线处理过程将不断迭代文本中的每一个文本特征上的主题投票。每次全部遍历过文本中所有文本特征则算作一次迭代,迭代次数作为参数预先输入即可,通常参考值10以上,可以根据主题数量多少酌情增减。
对于每一个要更新的主题投票过程,决定下一次投票给哪个主题主要受两个因素的影响,一个是文本中所有其他文本特征在该时刻的投票主题分布,另一个是该文本特征在主题特征关系矩阵中隶属于不同主题下的分布情况。对于每一个可能的下一次投票值,可以得到投票给该主题的概率,公式1如下:
Figure PCTCN2016107312-appb-000001
由上述公式1可以计算出下一次可能投票给每一个主题的概率,算法的采样过程将以此概率为概率投票给相应的主题,具体如下:
假设有长度为m的数组,其每一个值分别表示P(主题x|特征k),首先,遍历数组,并对后一项做累计求和计算,生成累计概率数组。例如:若原始数组为[0.2,0.5,0.3],累计求和后的结果则为[0.2,0.7(0.2+0.5),1(0.7+0.3)]。其次,生成一个随机数,并以此进行第二次遍历,判断随机数是否小于数组中的当前值,若小于,则返回当前的遍历位置。例如,若生成随机数为0.88,则遍历第0个数0.2,判断0.8>0.2,继续下一层遍历;遍历第1个数0.7,判断0.8>0.7,继续下一层遍历;遍历第2个数1,判断0.8<1,则返回遍历位置2。由此,经过在线的迭代处理过程,可以得到迭代终止后文本的每一个特征的投票主题。综合所有特征的投票主题结果,即能够得到文本的主题分布。
步骤S104,利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量。
在系统需要在内存中需要持久化存储上述得到的特征嵌入向量集合,在初始化时直接从存储媒介中读入特征嵌入向量集合到内存中即可。另外,也可以将训练特征嵌入向量集合的语料数据输入给系统,让系统预先在线训练特征嵌入向量集合,并将特征嵌入向量集合直接存储在内存中,后续直接进行在线处理即可。
在步骤S103中得到了文本的主题分布,但是当前的主题分布仍是通过文本本身拥有的特征生成的。实际上对于每一个主题的表述也是基于每一个独立的特征的。因此,每一个使用到的特征的概念之间并没有相互关系,也就是独立的。例如,当前得到了 本文的临时结果可能是目标文本中有20%描述音乐,30%描述体育,还有50%描述计算机。而各个主题下包含隶属度高的特征为{音乐:[“民谣”、“爵士”、“古典”],体育:[“足球”、“篮球”、“排球”],计算机:[“硬盘”、“内存”、“显卡”]}等等。但实际上,特征和特征之间的关系并不应该是相互独立存在的,因为“苹果”和“乔布斯”的关系一定比“苹果”和“体育”的关系要更接近,对于其它类型的特征也是同样的道理。因此,如果只是通过“足球”、“篮球”、“排球”描述体育类型,就会丢失掉很多其它与体育密切相关的其它特征信息,即如果没有考虑语义,那么对于体育主题来说,没有出现在主题描述特征下的“羽毛球”和“显示器”就都只能返回与体育无关的结果,但事实并非如此。因此本申请在此基础上考虑了特征的语义关系,对主题的描述由简单的特征集合基础上进行了基于语义的扩充,具体的实现方式为:基于训练过程中得到的特征嵌入向量模型,可以得到训练过程中每一个特征的特征嵌入向量,该嵌入向量维度为n,并可以近似的认为向量能够表征特征全部的信息。对于嵌入向量几何解释,可以解释成在n维空间中的一个点,或是由原点指向该点的向量。因此,不同的特征嵌入向量可以计算其向量的距离,因为嵌入向量表示特征,因此嵌入向量的具体也就是对应的特征之间的距离。
本申请中,根据描述主题的特征集合作为描述主题的源头,对源头的特征进行基于语义的特征扩展,扩展逻辑是计算与特征距离相近的其它特征,并与其它特征一同描述主题。例如,在上面的例子中,经过计算得到“羽毛球”与“足球”、“篮球”、“排球”的距离够较近,“羽毛球”则也作为属于描述主题“体育”的一个特征,而“显示器”就不能作为表述“体育”的特征。
在经过对描述主题在语义层面的扩展后,可以得到描述主题的丰富的特征集合,之后,需要对具体的主题进行语义层面的向量表示。同样,描述主题的向量同样需要依靠描述主题的特征集合。几何上,空间中若干点(特征)聚簇在一起,构成了空间中的一个点簇,这个点簇被共同描述某个主题。而某个主题对应于空间中的某一范围区域,需要使用一个向量近似的表示这一区域,最佳的方式之一是该区域的重心,即几何中点。由于主题区域未知,因此通过特征点簇近似的表示主题区域,即计算所有特征向量的算术平均值。至此,近似的得到了主题在语义层面上的向量表示,并且,向量的维度等于特征嵌入向量的维度n。
步骤S105,对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量。
可选地,在本申请实施例提供的文本向量表示方法中,对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量包括:分别将每个目标主题对应的比例与目标主题的向量进行相乘;以及将相乘后结果进行加权求和,得到表示测试文 本的向量。
具体地,在这一步骤中,需要汇总步骤S103和步骤S104得到的结果,并汇总得到文本的向量表示。在步骤S103中,得到了文本的主题分布,即文本在各个主题下所用篇幅的分布。在步骤S104中,得到了每一个主题在语义层面上的向量表示。汇总两个部分的结果,只需要对两个步骤的输出进行加权求和即可,通过如下公式2即可实现:
Figure PCTCN2016107312-appb-000002
例如:如果测试文本的主题分布是{音乐:0.2;体育:0.3;计算机:0.5;},且v(音乐)=[1,0,0],v(体育)=[0,1,0],v(计算机)=[0,0,1],那么,汇总得到的测试文本的向量表示为[0.2,0.3,0.5]。
本申请实施例提供的文本向量表示方法即是一种基于文本主题和语义的文本向量表示方法,这种方法极大程度地弥补了相关技术中文本向量表示方法对于文本字面背后隐含语义解释能力不足的缺点,能够表示文本背后的信息,具有更好的表示能力。同时本申请实施例提供的文本向量表示方法在线处理过程中支持单条文本输入并返回该文本的向量表示,并不需要依赖多条文本组成的语料整批的输入,这种数据流式的输入支持更符合实际功能需求。通过本申请实施例提供的文本向量表示方法得到的文本向量对于应用到后续文本分类、相似文章计算等自然语言处理任务时具有更高的评测结果,也大大提高了计算机对文本语义理解的程度。
本申请实施例提供的文本向量表示方法,通过获取测试文本;对测试文本进行特征化处理,得到多个文本特征表示的目标文本;利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例;利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量;以及对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量,解决了相关技术中的文本向量表示方法对文本包含的语义信息的表达能力较弱的问题。通过引入了主题和文本特征(语义)使得向量化的文本更具有表示能力,弥补了相关技术中文本向量表示方法对于文本字面背后隐含语义解释能力不足的缺点,也大大提高了计算机对文本语义理解的程度,进而达到了提升对文本包含的语义信息的表达能力的效果。
需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本申请实施例还提供了一种文本向量表示装置,需要说明的是,本申请实施例的文本向量表示装置可以用于执行本申请实施例所提供的用于文本向量表示方法。以下对本申请实施例提供的文本向量表示装置进行介绍。
图2是根据本申请实施例的文本向量表示装置的示意图。如图2所示,该装置包括:第一获取单元10、第一处理单元20、第二处理单元30、扩展单元40和计算单元50。
第一获取单元10,用于获取测试文本。
第一处理单元20,用于对测试文本进行特征化处理,得到多个文本特征表示的目标文本。
第二处理单元30,用于利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例。
扩展单元40,用于利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量。
计算单元50,用于对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量。
此处需要说明的是,上述第一获取单元10、第一处理单元20、第二处理单元30、扩展单元40和计算单元50可以作为装置的一部分运行在计算机终端中,可以通过计算机终端中的处理器来执行上述模块实现的功能,计算机终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。
本申请实施例提供的文本向量表示装置,通过第一获取单元10获取测试文本;第一处理单元20对测试文本进行特征化处理,得到多个文本特征表示的目标文本;第二处理单元30利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例;扩展单元40利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量;以及计算单元50对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量,解决了相关技术中的文本向量表示方法对文本包含的语义信息的表达能力较弱的问题。通过引入了主题和文本特征(语义)使得向量化的文本更具有表示能力,弥补了相关技术中文本向量表示方法对于文本字面背后隐含语义解释能力不足的缺点,也大大提高了计算机对文本语义理解的程度,进而达到了提升对文本包含的语义信息的表达能力的效果。
可选地,在本申请实施例提供的文本向量表示装置中,该装置还包括:第二获取单元,用于获取训练语料,其中,训练语料为用于训练的语料;第三处理单元,用于对训练语料进行特征化处理,得到多个语料特征;第一训练单元,用于分别训练每个语料特征的特征嵌入向量,得到特征嵌入向量集合;第三获取单元,用于获取训练语料中的多个主题;第二训练单元,用于分别训练每个主题与每个语料特征的关系,得到特征主题关系矩阵;以及存储单元,用于存储特征嵌入向量集合和特征主题关系矩阵。
可选地,在本申请实施例提供的文本向量表示装置中,特征化处理包括分词处理,第三处理单元还用于对训练语料进行分词处理,得到多个语料切分结果,第一处理单元还用于对测试文本进行分词处理,得到多个文本切分结果。
可选地,在本申请实施例提供的文本向量表示装置中,该装置还包括:第四处理单元,用于在对训练语料进行分词处理,得到多个语料切分结果之后,分别对每个语料切分结果进行id化处理,得到id化处理后的第一数据集合,其中,id化处理是指将每个语料切分结果对应一个id;以及通过第一数据集合表示多个语料特征,该装置还包括:第五处理单元,用于在对测试文本进行分词处理,得到多个文本切分结果之后,分别对每个文本切分结果进行id化处理,得到id化处理后的第二数据集合;以及通过第二数据集合表示目标文本。
可选地,在本申请实施例提供的文本向量表示装置中,分别训练每个语料特征的特征嵌入向量为采用Word2vec算法训练每个语料特征的特征嵌入向量。
可选地,在本申请实施例提供的文本向量表示装置中,分别训练每个主题与每个语料特征的关系为采用LDA算法训练每个主题与每个语料特征的关系。
可选地,在本申请实施例提供的文本向量表示装置中,第二处理单元30还用于将目标文本通过预存的特征主题关系矩阵按照预设变换方式进行变换,得到目标文本的主题分布,其中,预设变换方式中采用的算法为LDA算法。
本申请实施例所提供的各个功能单元可以在移动终端、计算机终端或者类似的运算装置中运行,也可以作为存储介质的一部分进行存储。
由此,本发明的实施例可以提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。可选地,在本实施例中,上述计算机终端也可以替换为移动终端等终端设备。
可选地,在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。
在本实施例中,上述计算机终端可以执行文本向量表示方法中以下步骤的程序代码:获取测试文本;对测试文本进行特征化处理,得到多个文本特征表示的目标文本;利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例;利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量;以及对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量。
可选地,该计算机终端可以包括:一个或多个处理器、存储器、以及传输装置。
其中,存储器可用于存储软件程序以及模块,如本发明实施例中的文本向量表示方法及装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的文本向量表示方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
其中,具体地,存储器用于存储预设动作条件和预设权限用户的信息、以及应用程序。
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行上述方法实施例中的各个可选或优选实施例的方法步骤的程序代码。
本领域普通技术人员可以理解,计算机终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以用于保存上述方法实施例和装置实施例所提供的文本向量表示方法所执行的程序代码。
可选地,在本实施例中,上述存储介质可以位于计算机网络中计算机终端群中的任意一个计算机终端中,或者位于移动终端群中的任意一个移动终端中。
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:获取测试文本;对测试文本进行特征化处理,得到多个文本特征表示的目标文本;利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例;利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量;以及对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量。
可选地,在本实施例中,存储介质还可以被设置为存储文本向量表示方法提供的各种优选地或可选的方法步骤的程序代码。
如上参照附图以示例的方式描述了根据本发明的文本向量表示方法及装置。但是,本领域技术人员应当理解,对于上述本发明所提出的文本向量表示方法及装置,还可以在不脱离本发明内容的基础上做出各种改进。因此,本发明的保护范围应当由所附的权利要求书的内容确定。
所述文本向量表示装置包括处理器和存储器,上述第一获取单元、第一处理单元、第二处理单元、扩展单元和计算单元等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元实现相应功能。
处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数表示文本向量。
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。
本申请还提供了一种计算机程序产品的实施例,当在数据处理设备上执行时,适于执行初始化有如下方法步骤的程序代码:获取测试文本;对测试文本进行特征化处理,得到多个文本特征表示的目标文本;利用预存的特征主题关系矩阵处理目标文本,得到目标文本的主题分布,其中,主题分布包括目标文本的目标主题与目标主题对应的比例;利用预存的特征嵌入向量集合对描述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据目标主题特征集合得到表示目标主题的向量;以及对主题分布和表示目标主题的向量进行计算处理,得到表示测试文本的向量。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所 组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种文本向量表示方法,其特征在于,包括:
    获取测试文本;
    对所述测试文本进行特征化处理,得到多个文本特征表示的目标文本;
    利用预存的特征主题关系矩阵处理所述目标文本,得到所述目标文本的主题分布,其中,所述主题分布包括所述目标文本的目标主题与所述目标主题对应的比例;
    利用预存的特征嵌入向量集合对描述所述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据所述目标主题特征集合得到表示所述目标主题的向量;以及
    对所述主题分布和表示所述目标主题的向量进行计算处理,得到表示所述测试文本的向量。
  2. 根据权利要求1所述的方法,其特征在于,在获取所述测试文本之前,所述方法还包括:
    获取训练语料,其中,所述训练语料为用于训练的语料;
    对所述训练语料进行特征化处理,得到多个语料特征;
    分别训练每个语料特征的特征嵌入向量,得到特征嵌入向量集合;
    获取所述训练语料中的多个主题;
    分别训练每个主题与每个语料特征的关系,得到特征主题关系矩阵;以及
    存储所述特征嵌入向量集合和所述特征主题关系矩阵。
  3. 根据权利要求2所述的方法,其特征在于,
    所述特征化处理包括分词处理,对所述训练语料进行特征化处理,得到多个语料特征包括:对所述训练语料进行分词处理,得到多个语料切分结果,
    对所述测试文本进行特征化处理,得到多个文本特征表示的目标文本包括:对所述测试文本进行分词处理,得到多个文本切分结果。
  4. 根据权利要求3所述的方法,其特征在于,
    在对所述训练语料进行分词处理,得到多个语料切分结果之后,所述方法还包括:分别对每个语料切分结果进行id化处理,得到id化处理后的第一数据集合, 其中,所述id化处理是指将每个语料切分结果对应一个id;以及通过所述第一数据集合表示所述多个语料特征,
    在对所述测试文本进行分词处理,得到多个文本切分结果之后,所述方法还包括:分别对每个文本切分结果进行id化处理,得到id化处理后的第二数据集合;以及通过所述第二数据集合表示所述目标文本。
  5. 根据权利要求2所述的方法,其特征在于,分别训练每个语料特征的特征嵌入向量为采用Word2vec算法训练每个语料特征的特征嵌入向量。
  6. 根据权利要求2所述的方法,其特征在于,分别训练每个主题与每个语料特征的关系为采用LDA算法训练每个主题与每个语料特征的关系。
  7. 根据权利要求6所述的方法,其特征在于,利用预存的特征主题关系矩阵处理所述目标文本,得到所述目标文本的主题分布包括:
    将所述目标文本通过所述预存的特征主题关系矩阵按照预设变换方式进行变换,得到所述目标文本的主题分布,其中,所述预设变换方式中采用的算法为所述LDA算法。
  8. 根据权利要求1所述的方法,其特征在于,对所述主题分布和表示所述目标主题的向量进行计算处理,得到表示所述测试文本的向量包括:
    分别将每个目标主题对应的比例与所述目标主题的向量进行相乘;以及
    将相乘后结果进行加权求和,得到表示所述测试文本的向量。
  9. 一种文本向量表示装置,其特征在于,包括:
    第一获取单元,用于获取测试文本;
    第一处理单元,用于对所述测试文本进行特征化处理,得到多个文本特征表示的目标文本;
    第二处理单元,用于利用预存的特征主题关系矩阵处理所述目标文本,得到所述目标文本的主题分布,其中,所述主题分布包括所述目标文本的目标主题与所述目标主题对应的比例;
    扩展单元,用于利用预存的特征嵌入向量集合对描述所述目标主题的文本特征进行扩展,得到目标主题特征集合,并根据所述目标主题特征集合得到表示所述目标主题的向量;以及
    计算单元,用于对所述主题分布和表示所述目标主题的向量进行计算处理, 得到表示所述测试文本的向量。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    第二获取单元,用于获取训练语料,其中,所述训练语料为用于训练的语料;
    第三处理单元,用于对所述训练语料进行特征化处理,得到多个语料特征;
    第一训练单元,用于分别训练每个语料特征的特征嵌入向量,得到特征嵌入向量集合;
    第三获取单元,用于获取所述训练语料中的多个主题;
    第二训练单元,用于分别训练每个主题与每个语料特征的关系,得到特征主题关系矩阵;以及
    存储单元,用于存储所述特征嵌入向量集合和所述特征主题关系矩阵。
  11. 根据权利要求10所述的装置,其特征在于,
    所述特征化处理包括分词处理,所述第三处理单元还用于对所述训练语料进行分词处理,得到多个语料切分结果,
    所述第一处理单元还用于对所述测试文本进行分词处理,得到多个文本切分结果。
  12. 根据权利要求11所述的装置,其特征在于,
    所述装置还包括:第四处理单元,用于在对所述训练语料进行分词处理,得到多个语料切分结果之后,分别对每个语料切分结果进行id化处理,得到id化处理后的第一数据集合,其中,所述id化处理是指将每个语料切分结果对应一个id;以及通过所述第一数据集合表示所述多个语料特征,
    所述装置还包括:第五处理单元,用于在对所述测试文本进行分词处理,得到多个文本切分结果之后,分别对每个文本切分结果进行id化处理,得到id化处理后的第二数据集合;以及通过所述第二数据集合表示所述目标文本。
  13. 根据权利要求10所述的装置,其特征在于,分别训练每个语料特征的特征嵌入向量为采用Word2vec算法训练每个语料特征的特征嵌入向量。
  14. 根据权利要求10所述的装置,其特征在于,分别训练每个主题与每个语料特征的关系为采用LDA算法训练每个主题与每个语料特征的关系。
  15. 根据权利要求14所述的装置,其特征在于,所述第二处理单元还用于将所述目标 文本通过所述预存的特征主题关系矩阵按照预设变换方式进行变换,得到所述目标文本的主题分布,其中,所述预设变换方式中采用的算法为所述LDA算法。
PCT/CN2016/107312 2015-11-30 2016-11-25 文本向量表示方法及装置 WO2017092623A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510860394.2 2015-11-30
CN201510860394.2A CN106815244B (zh) 2015-11-30 2015-11-30 文本向量表示方法及装置

Publications (1)

Publication Number Publication Date
WO2017092623A1 true WO2017092623A1 (zh) 2017-06-08

Family

ID=58796339

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/107312 WO2017092623A1 (zh) 2015-11-30 2016-11-25 文本向量表示方法及装置

Country Status (2)

Country Link
CN (1) CN106815244B (zh)
WO (1) WO2017092623A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829153A (zh) * 2019-01-04 2019-05-31 平安科技(深圳)有限公司 基于卷积神经网络的意图识别方法、装置、设备及介质
CN109858022A (zh) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 一种用户意图识别方法、装置、计算机设备及存储介质
CN110046228A (zh) * 2019-04-18 2019-07-23 合肥工业大学 短文本主题识别方法和系统
CN110413730A (zh) * 2019-06-27 2019-11-05 平安科技(深圳)有限公司 文本信息匹配度检测方法、装置、计算机设备和存储介质
CN110705289A (zh) * 2019-09-29 2020-01-17 重庆邮电大学 一种基于神经网络和模糊推理的中文分词方法、系统及介质
CN113010681A (zh) * 2021-03-24 2021-06-22 华东理工大学 一种基于句子向量化的无监督选取医疗语料文本方法
CN115392192A (zh) * 2022-10-27 2022-11-25 北京中科汇联科技股份有限公司 一种混合神经网络和字符信息的文本编码方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975499B (zh) * 2016-04-27 2019-06-25 深圳大学 一种文本主题检测方法及系统
CN109388796B (zh) * 2017-08-11 2023-04-18 北京国双科技有限公司 裁判文书的推送方法及装置
CN107562729B (zh) * 2017-09-14 2020-12-08 云南大学 基于神经网络和主题强化的党建文本表示方法
CN110965970B (zh) * 2018-09-29 2022-02-11 北京国双科技有限公司 注水井与采油井的相关性的确定方法及装置
CN111984789B (zh) * 2020-08-26 2024-01-30 普信恒业科技发展(北京)有限公司 一种语料分类方法、装置及服务器

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120078911A1 (en) * 2010-09-28 2012-03-29 Microsoft Corporation Text classification using concept kernel
CN103455581A (zh) * 2013-08-26 2013-12-18 北京理工大学 基于语义扩展的海量短文本信息过滤方法
CN104298776A (zh) * 2014-11-04 2015-01-21 苏州大学 基于lda模型的搜索引擎结果优化系统
CN104965867A (zh) * 2015-06-08 2015-10-07 南京师范大学 基于chi特征选取的文本事件分类方法
CN104991891A (zh) * 2015-07-28 2015-10-21 北京大学 一种短文本特征提取方法
CN105045812A (zh) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 文本主题的分类方法及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106657A1 (en) * 2005-11-10 2007-05-10 Brzeski Vadim V Word sense disambiguation
CN102929906B (zh) * 2012-08-10 2015-07-22 北京邮电大学 基于内容特征和主题特征的文本分组聚类方法
CN104408153B (zh) * 2014-12-03 2018-07-31 中国科学院自动化研究所 一种基于多粒度主题模型的短文本哈希学习方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120078911A1 (en) * 2010-09-28 2012-03-29 Microsoft Corporation Text classification using concept kernel
CN103455581A (zh) * 2013-08-26 2013-12-18 北京理工大学 基于语义扩展的海量短文本信息过滤方法
CN104298776A (zh) * 2014-11-04 2015-01-21 苏州大学 基于lda模型的搜索引擎结果优化系统
CN104965867A (zh) * 2015-06-08 2015-10-07 南京师范大学 基于chi特征选取的文本事件分类方法
CN105045812A (zh) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 文本主题的分类方法及系统
CN104991891A (zh) * 2015-07-28 2015-10-21 北京大学 一种短文本特征提取方法

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829153A (zh) * 2019-01-04 2019-05-31 平安科技(深圳)有限公司 基于卷积神经网络的意图识别方法、装置、设备及介质
CN109858022A (zh) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 一种用户意图识别方法、装置、计算机设备及存储介质
CN110046228A (zh) * 2019-04-18 2019-07-23 合肥工业大学 短文本主题识别方法和系统
CN110413730A (zh) * 2019-06-27 2019-11-05 平安科技(深圳)有限公司 文本信息匹配度检测方法、装置、计算机设备和存储介质
CN110413730B (zh) * 2019-06-27 2024-06-07 平安科技(深圳)有限公司 文本信息匹配度检测方法、装置、计算机设备和存储介质
CN110705289A (zh) * 2019-09-29 2020-01-17 重庆邮电大学 一种基于神经网络和模糊推理的中文分词方法、系统及介质
CN110705289B (zh) * 2019-09-29 2023-03-28 重庆邮电大学 一种基于神经网络和模糊推理的中文分词方法、系统及介质
CN113010681A (zh) * 2021-03-24 2021-06-22 华东理工大学 一种基于句子向量化的无监督选取医疗语料文本方法
CN113010681B (zh) * 2021-03-24 2024-03-15 华东理工大学 一种基于句子向量化的无监督选取医疗语料文本方法
CN115392192A (zh) * 2022-10-27 2022-11-25 北京中科汇联科技股份有限公司 一种混合神经网络和字符信息的文本编码方法及系统
CN115392192B (zh) * 2022-10-27 2023-01-17 北京中科汇联科技股份有限公司 一种混合神经网络和字符信息的文本编码方法及系统

Also Published As

Publication number Publication date
CN106815244A (zh) 2017-06-09
CN106815244B (zh) 2020-02-07

Similar Documents

Publication Publication Date Title
WO2017092623A1 (zh) 文本向量表示方法及装置
JP7322044B2 (ja) レコメンダシステムのための高効率畳み込みネットワーク
CN109902222B (zh) 一种推荐方法及装置
WO2020143321A1 (zh) 一种基于变分自编码器的训练样本数据扩充方法、存储介质及计算机设备
CN111950596B (zh) 一种用于神经网络的训练方法以及相关设备
WO2019144892A1 (zh) 数据处理方法、装置、存储介质和电子装置
US10824804B2 (en) Method and system for expansion to everyday language by using word vectorization technique based on social network content
EP2866421B1 (en) Method and apparatus for identifying a same user in multiple social networks
WO2023065859A1 (zh) 物品推荐方法、装置及存储介质
CN113468227B (zh) 基于图神经网络的信息推荐方法、系统、设备和存储介质
CN105468596B (zh) 图片检索方法和装置
WO2020114108A1 (zh) 聚类结果的解释方法和装置
CN116049412B (zh) 文本分类方法、模型训练方法、装置及电子设备
CN111553215A (zh) 人员关联方法及其装置、图卷积网络训练方法及其装置
CN112085615A (zh) 图神经网络的训练方法及装置
Miao et al. Evolving convolutional neural networks by symbiotic organisms search algorithm for image classification
CN107527071A (zh) 一种基于花朵授粉算法优化模糊k近邻的分类方法及装置
Concolato et al. Data science: A new paradigm in the age of big-data science and analytics
CN112966072A (zh) 案件的预判方法、装置、电子装置和存储介质
CN112364198A (zh) 一种跨模态哈希检索方法、终端设备及存储介质
CN116450941A (zh) 基于洛伦兹图卷积网络的书籍推荐方法及系统
CN115879508A (zh) 一种数据处理方法及相关装置
CN115775349A (zh) 基于多模态融合的假新闻检测方法和装置
CN108959550A (zh) 用户关注点挖掘方法、装置、设备及计算机可读介质
WO2024007873A1 (zh) 一种图处理方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16869933

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16869933

Country of ref document: EP

Kind code of ref document: A1