CN112883715A

CN112883715A - Word vector construction method and device

Info

Publication number: CN112883715A
Application number: CN201911197725.3A
Authority: CN
Inventors: 刘垚; 邹更; 任钰欣; 黄梓杰
Original assignee: Wuhan Yujianwan Technology Co ltd
Current assignee: Wuhan Yujianwan Technology Co ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2021-06-01
Anticipated expiration: 2039-11-29
Also published as: CN112883715B

Abstract

The invention discloses a method and a device for constructing a word vector, wherein the method comprises the following steps: preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus consisting of the m independent texts; taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabularies which commonly appear in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word; screening vocabulary elements in a vocabulary set; according to the filtered vocabulary set X_IDetermining the frequency index of each word element and the co-occurrence of each word element and the concept subject wordThe importance of the sink element; constructing a screened vocabulary set X according to the importance of each vocabulary element and the initial word vector_IThe word vector of (2). The word vector constructed by the method can fully express the relation between the vocabulary and the global text.

Description

Word vector construction method and device

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for constructing word vectors.

Background

Currently, in the field of natural language processing, word vectors are a common feature representation method for linguistic symbols. Common word embedding methods for constructing word vectors mainly include word2vec, GloVe and the like.

In the process of implementing the invention, the inventor of the application finds that the prior method at least has the following technical problems:

the word2vec method depends on local information of the vocabulary relation of front and back vocabularies, and has no relation to the overall information of the text; and GloVe relates to the statistical information of the global vocabulary of the text while utilizing the local information of the vocabulary. The two methods cannot comprehensively express the relation between the vocabulary and the global text, so that the constructed word vector information is not comprehensive enough.

Therefore, the method in the prior art has the technical problem that the information is not comprehensive enough.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for constructing a word vector, so as to solve or at least partially solve the technical problem that the expression of word vector information constructed by the method in the prior art is not comprehensive.

In order to solve the above technical problem, a first aspect of the present invention provides a method for constructing a word vector, including:

preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus formed by the m independent texts, wherein m is a positive integer;

taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements;

screening vocabulary elements in a vocabulary set;

according to the filtered vocabulary set X_IDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word;

constructing a screened vocabulary set X according to the importance of each vocabulary element and the initial word vector_IThe initial word vector is obtained by a preset word embedding mode.

In one embodiment, screening vocabulary elements in a vocabulary set includes:

counting each vocabulary element x in the vocabulary set_jAnd concept topic word x_iThe number z of the texts which appear together, wherein z is less than or equal to m;

and judging whether the text quantity z is larger than or equal to a first threshold value, if so, taking the vocabulary elements as effective vocabularies of the vocabulary set and keeping the effective vocabularies in the vocabulary set, otherwise, removing the vocabulary elements from the vocabulary set.

In one embodiment, the set X of words is selected according to the selected words_IDetermining the importance of each vocabulary element under the same concept topic word by the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word, wherein the importance comprises the following steps:

calculating the filtered vocabulary set X_IThe inverse text frequency index of each vocabulary element;

using the vocabulary element x_jAnd concept topic word x_iNumber of co-occurring texts z_jAnd vocabulary set X_IAll vocabulary elements and concept topic words x in_iThe ratio of the sum of the number of the commonly occurring texts represents the co-occurrence condition of the vocabulary elements and the concept subject words;

and determining the importance of each vocabulary element under the same concept subject word according to the inverse text frequency index of each vocabulary element and the calculated ratio.

In one embodiment, a filtered vocabulary set X is computed_IThe inverse text frequency index for each vocabulary element, comprising:

according to the total number of vocabulary sets and the included vocabulary element x_jThe inverse text frequency index of each vocabulary element is calculated by the vocabulary set, and the calculation formula is as follows:

wherein, count_XIndicates the total number of vocabulary sets, count_xjThe representation contains a lexical element x_jNumber of vocabulary sets, IDF_jRepresenting a lexical element x_jThe inverse text frequency index of (c).

In one embodiment, determining the importance of each vocabulary element under the same concept topic based on the inverse text frequency index and the calculated ratio for each vocabulary element comprises:

and multiplying the inverse text frequency index of each vocabulary element by the calculated ratio to obtain a weight coefficient, wherein the weight coefficient is used for representing the importance of each vocabulary element under the same concept subject word.

In one embodiment, a filtered vocabulary set X is constructed based on the importance of each vocabulary element and the initial word vector_IThe word vector of (1), comprising:

and multiplying the weight coefficient of each vocabulary element in the vocabulary set by the initial word vector corresponding to the vocabulary element, and summing to obtain the word vector of the vocabulary set.

In one embodiment, after constructing the word vectors for the vocabulary set, the method further comprises:

according to each vocabulary set X_IAnd (4) clustering the vectors of all vocabulary sets by adopting a k-means algorithm.

Based on the same inventive concept, a second aspect of the present invention provides a word vector constructing apparatus, including:

the preprocessing module is used for preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of the corpus formed by the m independent texts, wherein m is a positive integer;

the vocabulary set building module is used for traversing all sentences and vocabularies by taking each vocabulary of the corpus as a concept subject word, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements;

the vocabulary element screening module is used for screening vocabulary elements in the vocabulary set;

a vocabulary element importance determination module for collecting X vocabulary according to the screened vocabulary_IDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word;

a word vector construction module for constructing the screened vocabulary set X according to the importance of each vocabulary element and the initial word vector_IThe initial word vector is obtained by a preset word embedding mode.

Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.

Based on the same inventive concept, a fourth aspect of the present invention provides a computer device having a computer program stored thereon, which when executed performs the method of the first aspect.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides a word vector construction method, which comprises the steps of firstly preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus consisting of the m independent texts; then, each vocabulary of the corpus is used as a concept subject word, all sentences and vocabularies are traversed, and the vocabularies which commonly appear in the same sentence with the concept subject word are brought into a vocabulary set corresponding to the concept subject word; then, the words in the vocabulary set are alignedScreening the vocabulary elements; then according to the filtered vocabulary set X_IDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word; then according to the importance of each vocabulary element and the initial word vector, constructing a screened vocabulary set X_IThe word vector of (a).

According to the method provided by the invention, on the basis of constructing the word vector by using the local text information through the preset word embedding method, the relation between the words of the global text is represented by combining the set formed by the words which are frequently co-occurring with the target words in the global text data, and the word vector of the word is corrected by using the relation information, so that a new word vector can be obtained, the relation between the target words and the global text words can be fully utilized, and the constructed word vector information is more comprehensive and abundant.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a word vector construction method according to the present invention;

FIG. 2 is a diagram illustrating clustering results obtained in an exemplary embodiment;

FIG. 3 is a block diagram of an apparatus for constructing word vectors according to an embodiment of the present invention;

FIG. 4 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;

fig. 5 is a block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention aims to provide a method and a device for constructing word vectors aiming at the technical problem that the expression of word vector information constructed by the method in the prior art is incomplete, so that the aim of improving the comprehensiveness of the word vector information is fulfilled.

In order to achieve the above object, the main concept of the present invention is as follows:

according to the co-occurrence rule in the text, a symbol set (vocabulary set) taking a single language symbol (concept subject word) as a center is constructed, meanwhile, a word embedding method is used for carrying out vector feature representation on each language symbol, and the vector of the vocabulary in the symbol set is used for correcting the center language symbol according to weight pairs. On the basis that word vectors are constructed by word2vec through local text information, the relation between the global text vocabularies is represented by combining a set formed by vocabularies which are frequently co-occurring with the target vocabularies in the global text data, and the word vectors of the vocabularies are corrected by utilizing the relation information to obtain new word vectors, so that the information of the new word vectors is richer, and the relation between the target vocabularies and the global text vocabularies can be more fully represented.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The present embodiment provides a method for constructing a word vector, please refer to fig. 1, the method includes:

step S1: preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus formed by the m independent texts, wherein m is a positive integer.

Specifically, the number of m may be determined according to actual conditions. The preprocessing comprises the steps of sentence segmentation, word segmentation, stop word removal and the like.

Step S2: and taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements.

Specifically, in step S1, all words of the corpus can be obtained, and in this step, based on each word, words appearing in the same sentence as the word are searched, and a word set is constructed by the words appearing together with the word, and when the word appears together with the concept topic word, it indicates that two words have an association, where the word set includes two words, one is the concept topic word and the other is a word element.

Step S3: and screening the vocabulary elements in the vocabulary set.

Specifically, in order to improve the accuracy of the vocabulary set, the step further screens the vocabulary elements, which may be screened according to the number of times that the vocabulary elements and the concept topic words appear. Whether to retain a vocabulary element is determined, for example, by determining whether the vocabulary element frequently appears in one text together with a concept topic word, wherein the frequent occurrence may be determined based on a set threshold.

In one embodiment, step S3 includes:

Specifically, the first threshold may be set according to actual conditions, and may be, for example, 3, 5, 6, or the like. Through the screening of the vocabulary elements, the vocabulary which often appears in the same text with the concept subject word can be selected and used as the effective vocabulary, so that the accuracy of the vocabulary set is improved.

Step S4: according to the filtered vocabulary set X_IThe inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept subject word determine the importance of each vocabulary element under the same concept subject word.

Specifically, the basic idea of the IDF is that if the vocabulary set containing a certain vocabulary element is fewer, the IDF value is larger, which indicates that the vocabulary element has better distinguishing capability in the vocabulary set. The co-occurrence of each vocabulary element with the concept topic word may be the co-occurrence number of the vocabulary element with the concept topic word, or the ratio of the co-occurrence number of the vocabulary element with the concept topic word to the co-occurrence number of all the vocabulary elements with the concept topic word, etc.

In one embodiment, step S4 includes:

step S4.1: calculating the filtered vocabulary set X_IThe inverse text frequency index of each vocabulary element;

step S4.2: using the vocabulary element x_jAnd concept topic word x_iNumber of co-occurring texts z_jAnd vocabulary set X_IAll vocabulary elements and concept topic words x in_iThe ratio of the sum of the number of the commonly occurring texts represents the co-occurrence condition of the vocabulary elements and the concept subject words;

step S4.3: and determining the importance of each vocabulary element under the same concept subject word according to the inverse text frequency index of each vocabulary element and the calculated ratio.

Specifically, for vocabulary set X_IAll the vocabulary elements { x in (1)_j,x_k,…x_nEach lexical element importance can be represented by x_jFrequency P of_jExpressed as follows:

i.e. the vocabulary element x_jAnd concept topic word x_iNumber of co-occurring texts z_jDivided by wordsSet X_IAll vocabulary elements and concept topic words x in_iThe sum of the number of co-occurring texts,

in the embodiment, the importance of each vocabulary element under the same concept topic is further determined according to the inverse text frequency index based on the ratio of the number of times of the co-occurrence of the vocabulary elements and the concept topic and the number of times of the co-occurrence of all the vocabulary elements and the concept topic.

In one embodiment, step S4.1 comprises:

Specifically, the vocabulary element x is paired with_jThe inverse text frequency index of which is divided by the total number of vocabulary sets, by the inclusion of the vocabulary element x_jThe number of the vocabulary sets obtained is then taken as the logarithm of the base 10,

in one embodiment, step S4.3 comprises:

Specifically, for vocabulary set X_IEach vocabulary element x in_jIts weight coefficient Tx_jThe calculation method of (2) is as follows:

Tx_j＝IDF_j*P_j。

step S5: constructing a screened vocabulary set X according to the importance of each vocabulary element and the initial word vector_IThe initial word vector is obtained by a preset word embedding mode.

In one embodiment, step S5 includes:

Specifically, a vocabulary set X is obtained_IWord vector V_IThe form of (A) is as follows:

V_I＝∑Tx_j*v_j

wherein, the weight coefficient Tx of each vocabulary element is passed_jWord vector v with lexical elements_jMultiplied and then summed.

Specifically, the word vectors of all the vocabulary sets are subjected to k-means clustering, the vocabulary sets are divided into k types, and the category of each vocabulary set is obtained.

In order to more clearly illustrate the specific implementation of the method of the present invention, the following is presented by way of specific examples:

1. vocabulary set X_StockExamples of the invention

The concept topic word is "stock", which appears in 51 articles of the corpus. All the vocabulary elements in the following table are the effective vocabulary (section selection) under the concept of stock

2. Vocabulary set X_StockExample of lexical element weight calculation in (1)

3. Computing a vocabulary set X_Stock∈{x_j,x_k... } vector V_Stock

4. Clustering all vocabulary sets

In the example, ten thousand vocabulary sets are clustered, resulting in 13 classes. Some of these results are as follows:

1 [ "data organization", "operation", "image smoothing", "performance evaluation", "inner pixel", "coincidence", and

"transformation of industrial structure", "agricultural production", "transformation period", "regional economy", "market restriction" ]

3 [ "documentary", "structural metaphor", "poetry", "writing order", "literary value", "aesthetic personality". ]

4 [ "glassy carbon electrode", "centrifugal dehydration", "film stretching", "hexadecyl trimethyl ammonium bromide", "polyurethane" ]

5 [ "radix scrophulariae", "throat", "main treatment", "sleep disorder", "cerebral hemorrhage", "knee osteoarthritis". ]

6 [ "use seedling", "bud", "pot experiment", "survival rate", "water quality safety", "planting", "plant community" ]

Please refer to fig. 2, which is a visual diagram of the clustering result, the vector of the vocabulary set is reduced from high dimension to two dimensions, and is drawn into a diagram, so as to observe the clustering effect. Fig. 2 shows a two-dimensional space after dimension reduction, the horizontal and vertical coordinates in the diagram represent two dimensions respectively, each point in the diagram represents a vocabulary set, different colors of the point represent the category of the cluster, and the horizontal and vertical coordinates of the point represent two-dimensional coordinate values after dimension reduction.

Example two

Based on the same inventive concept, the present embodiment provides a word vector constructing apparatus, please refer to fig. 3, including:

a preprocessing module 201, configured to preprocess each text included in a text set including m independent texts, to obtain all sentences and vocabularies of a corpus configured by the m independent texts, where m is a positive integer;

the vocabulary set constructing module 202 is used for respectively taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements;

the vocabulary element screening module 203 is used for screening vocabulary elements in the vocabulary set;

a vocabulary element importance determination module 204 for determining the importance of the vocabulary element according to the filtered vocabulary set X_IDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word;

a word vector construction module 205, configured to construct the filtered vocabulary set X according to the importance of each vocabulary element and the initial word vector_IThe initial word vector is obtained by a preset word embedding mode.

In one embodiment, the vocabulary element filtering module is specifically configured to:

In one embodiment, the vocabulary element importance determination module is specifically configured to:

In one embodiment, the vocabulary element importance determination module is further configured to:

In one embodiment, the word vector construction module is further configured to:

In one embodiment, the apparatus further comprises a clustering module configured to:

according to each vocabulary set X_IThe vectors of all vocabulary sets are processed by adopting a k-means algorithmAnd (6) clustering.

Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the method for constructing a word vector in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the apparatus based on the method described in the first embodiment of the present invention, and thus the details are not described herein again. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.

EXAMPLE III

Referring to fig. 4, based on the same inventive concept, the present application further provides a computer-readable storage medium 300, on which a computer program 311 is stored, which when executed implements the method according to the first embodiment.

Since the computer-readable storage medium described in the third embodiment of the present invention is a computer device used for implementing the method for constructing a word vector in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the computer-readable storage medium based on the method described in the first embodiment of the present invention, and thus details are not described herein. Any computer readable storage medium used in the method of the first embodiment of the present invention is within the scope of the present invention.

Example four

Based on the same inventive concept, the present application further provides a computer device, please refer to fig. 5, which includes a storage 401, a processor 402, and a computer program 403 stored in the storage and running on the processor, and when the processor 402 executes the above program, the method in the first embodiment is implemented.

Since the computer device introduced in the fourth embodiment of the present invention is a computer device used for implementing the method for constructing the word vector in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the computer device based on the method introduced in the first embodiment of the present invention, and thus the description thereof is omitted here. All the computer devices used in the method in the first embodiment of the present invention are within the scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A method for constructing a word vector, comprising:

screening vocabulary elements in a vocabulary set;

2. The method of claim 1, wherein filtering vocabulary elements in a vocabulary set comprises:

3. The method of claim 1, wherein X is selected from the filtered set of words_IDetermining the importance of each vocabulary element under the same concept topic word by the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word, wherein the importance comprises the following steps:

calculating the filtered vocabulary set X_IThe reverse of each vocabulary elementThe frequency index;

4. The method of claim 3 wherein the filtered vocabulary set X is computed_IThe inverse text frequency index for each vocabulary element, comprising:

5. The method of claim 3, wherein determining the importance of each vocabulary element under the same concept topic based on the inverse text frequency index and the calculated ratio for each vocabulary element comprises:

6. The method of claim 5, wherein the method is performed according to eachThe importance of the vocabulary elements and the initial word vector, and constructing a screened vocabulary set X_IThe word vector of (1), comprising:

7. The method of claim 1, wherein after constructing word vectors for a vocabulary set, the method further comprises:

8. An apparatus for constructing a word vector, comprising:

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the method of any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.