CN112883715A - Word vector construction method and device - Google Patents

Word vector construction method and device Download PDF

Info

Publication number
CN112883715A
CN112883715A CN201911197725.3A CN201911197725A CN112883715A CN 112883715 A CN112883715 A CN 112883715A CN 201911197725 A CN201911197725 A CN 201911197725A CN 112883715 A CN112883715 A CN 112883715A
Authority
CN
China
Prior art keywords
vocabulary
word
concept
importance
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911197725.3A
Other languages
Chinese (zh)
Other versions
CN112883715B (en
Inventor
刘垚
邹更
任钰欣
黄梓杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Yujianwan Technology Co ltd
Original Assignee
Wuhan Yujianwan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Yujianwan Technology Co ltd filed Critical Wuhan Yujianwan Technology Co ltd
Priority to CN201911197725.3A priority Critical patent/CN112883715B/en
Publication of CN112883715A publication Critical patent/CN112883715A/en
Application granted granted Critical
Publication of CN112883715B publication Critical patent/CN112883715B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for constructing a word vector, wherein the method comprises the following steps: preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus consisting of the m independent texts; taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabularies which commonly appear in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word; screening vocabulary elements in a vocabulary set; according to the filtered vocabulary set XIDetermining the frequency index of each word element and the co-occurrence of each word element and the concept subject wordThe importance of the sink element; constructing a screened vocabulary set X according to the importance of each vocabulary element and the initial word vectorIThe word vector of (2). The word vector constructed by the method can fully express the relation between the vocabulary and the global text.

Description

Word vector construction method and device
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a device for constructing word vectors.
Background
Currently, in the field of natural language processing, word vectors are a common feature representation method for linguistic symbols. Common word embedding methods for constructing word vectors mainly include word2vec, GloVe and the like.
In the process of implementing the invention, the inventor of the application finds that the prior method at least has the following technical problems:
the word2vec method depends on local information of the vocabulary relation of front and back vocabularies, and has no relation to the overall information of the text; and GloVe relates to the statistical information of the global vocabulary of the text while utilizing the local information of the vocabulary. The two methods cannot comprehensively express the relation between the vocabulary and the global text, so that the constructed word vector information is not comprehensive enough.
Therefore, the method in the prior art has the technical problem that the information is not comprehensive enough.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for constructing a word vector, so as to solve or at least partially solve the technical problem that the expression of word vector information constructed by the method in the prior art is not comprehensive.
In order to solve the above technical problem, a first aspect of the present invention provides a method for constructing a word vector, including:
preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus formed by the m independent texts, wherein m is a positive integer;
taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements;
screening vocabulary elements in a vocabulary set;
according to the filtered vocabulary set XIDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word;
constructing a screened vocabulary set X according to the importance of each vocabulary element and the initial word vectorIThe initial word vector is obtained by a preset word embedding mode.
In one embodiment, screening vocabulary elements in a vocabulary set includes:
counting each vocabulary element x in the vocabulary setjAnd concept topic word xiThe number z of the texts which appear together, wherein z is less than or equal to m;
and judging whether the text quantity z is larger than or equal to a first threshold value, if so, taking the vocabulary elements as effective vocabularies of the vocabulary set and keeping the effective vocabularies in the vocabulary set, otherwise, removing the vocabulary elements from the vocabulary set.
In one embodiment, the set X of words is selected according to the selected wordsIDetermining the importance of each vocabulary element under the same concept topic word by the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word, wherein the importance comprises the following steps:
calculating the filtered vocabulary set XIThe inverse text frequency index of each vocabulary element;
using the vocabulary element xjAnd concept topic word xiNumber of co-occurring texts zjAnd vocabulary set XIAll vocabulary elements and concept topic words x iniThe ratio of the sum of the number of the commonly occurring texts represents the co-occurrence condition of the vocabulary elements and the concept subject words;
and determining the importance of each vocabulary element under the same concept subject word according to the inverse text frequency index of each vocabulary element and the calculated ratio.
In one embodiment, a filtered vocabulary set X is computedIThe inverse text frequency index for each vocabulary element, comprising:
according to the total number of vocabulary sets and the included vocabulary element xjThe inverse text frequency index of each vocabulary element is calculated by the vocabulary set, and the calculation formula is as follows:
Figure BDA0002295086750000021
wherein, countXIndicates the total number of vocabulary sets, countxjThe representation contains a lexical element xjNumber of vocabulary sets, IDFjRepresenting a lexical element xjThe inverse text frequency index of (c).
In one embodiment, determining the importance of each vocabulary element under the same concept topic based on the inverse text frequency index and the calculated ratio for each vocabulary element comprises:
and multiplying the inverse text frequency index of each vocabulary element by the calculated ratio to obtain a weight coefficient, wherein the weight coefficient is used for representing the importance of each vocabulary element under the same concept subject word.
In one embodiment, a filtered vocabulary set X is constructed based on the importance of each vocabulary element and the initial word vectorIThe word vector of (1), comprising:
and multiplying the weight coefficient of each vocabulary element in the vocabulary set by the initial word vector corresponding to the vocabulary element, and summing to obtain the word vector of the vocabulary set.
In one embodiment, after constructing the word vectors for the vocabulary set, the method further comprises:
according to each vocabulary set XIAnd (4) clustering the vectors of all vocabulary sets by adopting a k-means algorithm.
Based on the same inventive concept, a second aspect of the present invention provides a word vector constructing apparatus, including:
the preprocessing module is used for preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of the corpus formed by the m independent texts, wherein m is a positive integer;
the vocabulary set building module is used for traversing all sentences and vocabularies by taking each vocabulary of the corpus as a concept subject word, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements;
the vocabulary element screening module is used for screening vocabulary elements in the vocabulary set;
a vocabulary element importance determination module for collecting X vocabulary according to the screened vocabularyIDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word;
a word vector construction module for constructing the screened vocabulary set X according to the importance of each vocabulary element and the initial word vectorIThe initial word vector is obtained by a preset word embedding mode.
Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.
Based on the same inventive concept, a fourth aspect of the present invention provides a computer device having a computer program stored thereon, which when executed performs the method of the first aspect.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a word vector construction method, which comprises the steps of firstly preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus consisting of the m independent texts; then, each vocabulary of the corpus is used as a concept subject word, all sentences and vocabularies are traversed, and the vocabularies which commonly appear in the same sentence with the concept subject word are brought into a vocabulary set corresponding to the concept subject word; then, the words in the vocabulary set are alignedScreening the vocabulary elements; then according to the filtered vocabulary set XIDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word; then according to the importance of each vocabulary element and the initial word vector, constructing a screened vocabulary set XIThe word vector of (a).
According to the method provided by the invention, on the basis of constructing the word vector by using the local text information through the preset word embedding method, the relation between the words of the global text is represented by combining the set formed by the words which are frequently co-occurring with the target words in the global text data, and the word vector of the word is corrected by using the relation information, so that a new word vector can be obtained, the relation between the target words and the global text words can be fully utilized, and the constructed word vector information is more comprehensive and abundant.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a word vector construction method according to the present invention;
FIG. 2 is a diagram illustrating clustering results obtained in an exemplary embodiment;
FIG. 3 is a block diagram of an apparatus for constructing word vectors according to an embodiment of the present invention;
FIG. 4 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;
fig. 5 is a block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The invention aims to provide a method and a device for constructing word vectors aiming at the technical problem that the expression of word vector information constructed by the method in the prior art is incomplete, so that the aim of improving the comprehensiveness of the word vector information is fulfilled.
In order to achieve the above object, the main concept of the present invention is as follows:
according to the co-occurrence rule in the text, a symbol set (vocabulary set) taking a single language symbol (concept subject word) as a center is constructed, meanwhile, a word embedding method is used for carrying out vector feature representation on each language symbol, and the vector of the vocabulary in the symbol set is used for correcting the center language symbol according to weight pairs. On the basis that word vectors are constructed by word2vec through local text information, the relation between the global text vocabularies is represented by combining a set formed by vocabularies which are frequently co-occurring with the target vocabularies in the global text data, and the word vectors of the vocabularies are corrected by utilizing the relation information to obtain new word vectors, so that the information of the new word vectors is richer, and the relation between the target vocabularies and the global text vocabularies can be more fully represented.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The present embodiment provides a method for constructing a word vector, please refer to fig. 1, the method includes:
step S1: preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus formed by the m independent texts, wherein m is a positive integer.
Specifically, the number of m may be determined according to actual conditions. The preprocessing comprises the steps of sentence segmentation, word segmentation, stop word removal and the like.
Step S2: and taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements.
Specifically, in step S1, all words of the corpus can be obtained, and in this step, based on each word, words appearing in the same sentence as the word are searched, and a word set is constructed by the words appearing together with the word, and when the word appears together with the concept topic word, it indicates that two words have an association, where the word set includes two words, one is the concept topic word and the other is a word element.
Step S3: and screening the vocabulary elements in the vocabulary set.
Specifically, in order to improve the accuracy of the vocabulary set, the step further screens the vocabulary elements, which may be screened according to the number of times that the vocabulary elements and the concept topic words appear. Whether to retain a vocabulary element is determined, for example, by determining whether the vocabulary element frequently appears in one text together with a concept topic word, wherein the frequent occurrence may be determined based on a set threshold.
In one embodiment, step S3 includes:
counting each vocabulary element x in the vocabulary setjAnd concept topic word xiThe number z of the texts which appear together, wherein z is less than or equal to m;
and judging whether the text quantity z is larger than or equal to a first threshold value, if so, taking the vocabulary elements as effective vocabularies of the vocabulary set and keeping the effective vocabularies in the vocabulary set, otherwise, removing the vocabulary elements from the vocabulary set.
Specifically, the first threshold may be set according to actual conditions, and may be, for example, 3, 5, 6, or the like. Through the screening of the vocabulary elements, the vocabulary which often appears in the same text with the concept subject word can be selected and used as the effective vocabulary, so that the accuracy of the vocabulary set is improved.
Step S4: according to the filtered vocabulary set XIThe inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept subject word determine the importance of each vocabulary element under the same concept subject word.
Specifically, the basic idea of the IDF is that if the vocabulary set containing a certain vocabulary element is fewer, the IDF value is larger, which indicates that the vocabulary element has better distinguishing capability in the vocabulary set. The co-occurrence of each vocabulary element with the concept topic word may be the co-occurrence number of the vocabulary element with the concept topic word, or the ratio of the co-occurrence number of the vocabulary element with the concept topic word to the co-occurrence number of all the vocabulary elements with the concept topic word, etc.
In one embodiment, step S4 includes:
step S4.1: calculating the filtered vocabulary set XIThe inverse text frequency index of each vocabulary element;
step S4.2: using the vocabulary element xjAnd concept topic word xiNumber of co-occurring texts zjAnd vocabulary set XIAll vocabulary elements and concept topic words x iniThe ratio of the sum of the number of the commonly occurring texts represents the co-occurrence condition of the vocabulary elements and the concept subject words;
step S4.3: and determining the importance of each vocabulary element under the same concept subject word according to the inverse text frequency index of each vocabulary element and the calculated ratio.
Specifically, for vocabulary set XIAll the vocabulary elements { x in (1)j,xk,…xnEach lexical element importance can be represented by xjFrequency P ofjExpressed as follows:
Figure BDA0002295086750000061
i.e. the vocabulary element xjAnd concept topic word xiNumber of co-occurring texts zjDivided by wordsSet XIAll vocabulary elements and concept topic words x iniThe sum of the number of co-occurring texts,
in the embodiment, the importance of each vocabulary element under the same concept topic is further determined according to the inverse text frequency index based on the ratio of the number of times of the co-occurrence of the vocabulary elements and the concept topic and the number of times of the co-occurrence of all the vocabulary elements and the concept topic.
In one embodiment, step S4.1 comprises:
according to the total number of vocabulary sets and the included vocabulary element xjThe inverse text frequency index of each vocabulary element is calculated by the vocabulary set, and the calculation formula is as follows:
Figure BDA0002295086750000071
wherein, countXIndicates the total number of vocabulary sets, countxjThe representation contains a lexical element xjNumber of vocabulary sets, IDFjRepresenting a lexical element xjThe inverse text frequency index of (c).
Specifically, the vocabulary element x is paired withjThe inverse text frequency index of which is divided by the total number of vocabulary sets, by the inclusion of the vocabulary element xjThe number of the vocabulary sets obtained is then taken as the logarithm of the base 10,
in one embodiment, step S4.3 comprises:
and multiplying the inverse text frequency index of each vocabulary element by the calculated ratio to obtain a weight coefficient, wherein the weight coefficient is used for representing the importance of each vocabulary element under the same concept subject word.
Specifically, for vocabulary set XIEach vocabulary element x injIts weight coefficient TxjThe calculation method of (2) is as follows:
Txj=IDFj*Pj
step S5: constructing a screened vocabulary set X according to the importance of each vocabulary element and the initial word vectorIThe initial word vector is obtained by a preset word embedding mode.
In one embodiment, step S5 includes:
and multiplying the weight coefficient of each vocabulary element in the vocabulary set by the initial word vector corresponding to the vocabulary element, and summing to obtain the word vector of the vocabulary set.
Specifically, a vocabulary set X is obtainedIWord vector VIThe form of (A) is as follows:
VI=∑Txj*vj
wherein, the weight coefficient Tx of each vocabulary element is passedjWord vector v with lexical elementsjMultiplied and then summed.
In one embodiment, after constructing the word vectors for the vocabulary set, the method further comprises:
according to each vocabulary set XIAnd (4) clustering the vectors of all vocabulary sets by adopting a k-means algorithm.
Specifically, the word vectors of all the vocabulary sets are subjected to k-means clustering, the vocabulary sets are divided into k types, and the category of each vocabulary set is obtained.
In order to more clearly illustrate the specific implementation of the method of the present invention, the following is presented by way of specific examples:
1. vocabulary set XStockExamples of the invention
The concept topic word is "stock", which appears in 51 articles of the corpus. All the vocabulary elements in the following table are the effective vocabulary (section selection) under the concept of stock
Figure BDA0002295086750000081
2. Vocabulary set XStockExample of lexical element weight calculation in (1)
Figure BDA0002295086750000082
Figure BDA0002295086750000091
3. Computing a vocabulary set XStock∈{xj,xk... } vector VStock
Figure BDA0002295086750000092
4. Clustering all vocabulary sets
In the example, ten thousand vocabulary sets are clustered, resulting in 13 classes. Some of these results are as follows:
1 [ "data organization", "operation", "image smoothing", "performance evaluation", "inner pixel", "coincidence", and
"transformation of industrial structure", "agricultural production", "transformation period", "regional economy", "market restriction" ]
3 [ "documentary", "structural metaphor", "poetry", "writing order", "literary value", "aesthetic personality". ]
4 [ "glassy carbon electrode", "centrifugal dehydration", "film stretching", "hexadecyl trimethyl ammonium bromide", "polyurethane" ]
5 [ "radix scrophulariae", "throat", "main treatment", "sleep disorder", "cerebral hemorrhage", "knee osteoarthritis". ]
6 [ "use seedling", "bud", "pot experiment", "survival rate", "water quality safety", "planting", "plant community" ]
Please refer to fig. 2, which is a visual diagram of the clustering result, the vector of the vocabulary set is reduced from high dimension to two dimensions, and is drawn into a diagram, so as to observe the clustering effect. Fig. 2 shows a two-dimensional space after dimension reduction, the horizontal and vertical coordinates in the diagram represent two dimensions respectively, each point in the diagram represents a vocabulary set, different colors of the point represent the category of the cluster, and the horizontal and vertical coordinates of the point represent two-dimensional coordinate values after dimension reduction.
Example two
Based on the same inventive concept, the present embodiment provides a word vector constructing apparatus, please refer to fig. 3, including:
a preprocessing module 201, configured to preprocess each text included in a text set including m independent texts, to obtain all sentences and vocabularies of a corpus configured by the m independent texts, where m is a positive integer;
the vocabulary set constructing module 202 is used for respectively taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements;
the vocabulary element screening module 203 is used for screening vocabulary elements in the vocabulary set;
a vocabulary element importance determination module 204 for determining the importance of the vocabulary element according to the filtered vocabulary set XIDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word;
a word vector construction module 205, configured to construct the filtered vocabulary set X according to the importance of each vocabulary element and the initial word vectorIThe initial word vector is obtained by a preset word embedding mode.
In one embodiment, the vocabulary element filtering module is specifically configured to:
counting each vocabulary element x in the vocabulary setjAnd concept topic word xiThe number z of the texts which appear together, wherein z is less than or equal to m;
and judging whether the text quantity z is larger than or equal to a first threshold value, if so, taking the vocabulary elements as effective vocabularies of the vocabulary set and keeping the effective vocabularies in the vocabulary set, otherwise, removing the vocabulary elements from the vocabulary set.
In one embodiment, the vocabulary element importance determination module is specifically configured to:
calculating the filtered vocabulary set XIThe inverse text frequency index of each vocabulary element;
using the vocabulary element xjAnd concept topic word xiNumber of co-occurring texts zjAnd vocabulary set XIAll vocabulary elements and concept topic words x iniThe ratio of the sum of the number of the commonly occurring texts represents the co-occurrence condition of the vocabulary elements and the concept subject words;
and determining the importance of each vocabulary element under the same concept subject word according to the inverse text frequency index of each vocabulary element and the calculated ratio.
In one embodiment, the vocabulary element importance determination module is further configured to:
according to the total number of vocabulary sets and the included vocabulary element xjThe inverse text frequency index of each vocabulary element is calculated by the vocabulary set, and the calculation formula is as follows:
Figure BDA0002295086750000111
wherein, countXIndicates the total number of vocabulary sets, countxjThe representation contains a lexical element xjNumber of vocabulary sets, IDFjRepresenting a lexical element xjThe inverse text frequency index of (c).
In one embodiment, the vocabulary element importance determination module is further configured to:
and multiplying the inverse text frequency index of each vocabulary element by the calculated ratio to obtain a weight coefficient, wherein the weight coefficient is used for representing the importance of each vocabulary element under the same concept subject word.
In one embodiment, the word vector construction module is further configured to:
and multiplying the weight coefficient of each vocabulary element in the vocabulary set by the initial word vector corresponding to the vocabulary element, and summing to obtain the word vector of the vocabulary set.
In one embodiment, the apparatus further comprises a clustering module configured to:
according to each vocabulary set XIThe vectors of all vocabulary sets are processed by adopting a k-means algorithmAnd (6) clustering.
Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the method for constructing a word vector in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the apparatus based on the method described in the first embodiment of the present invention, and thus the details are not described herein again. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.
EXAMPLE III
Referring to fig. 4, based on the same inventive concept, the present application further provides a computer-readable storage medium 300, on which a computer program 311 is stored, which when executed implements the method according to the first embodiment.
Since the computer-readable storage medium described in the third embodiment of the present invention is a computer device used for implementing the method for constructing a word vector in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the computer-readable storage medium based on the method described in the first embodiment of the present invention, and thus details are not described herein. Any computer readable storage medium used in the method of the first embodiment of the present invention is within the scope of the present invention.
Example four
Based on the same inventive concept, the present application further provides a computer device, please refer to fig. 5, which includes a storage 401, a processor 402, and a computer program 403 stored in the storage and running on the processor, and when the processor 402 executes the above program, the method in the first embodiment is implemented.
Since the computer device introduced in the fourth embodiment of the present invention is a computer device used for implementing the method for constructing the word vector in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the computer device based on the method introduced in the first embodiment of the present invention, and thus the description thereof is omitted here. All the computer devices used in the method in the first embodiment of the present invention are within the scope of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A method for constructing a word vector, comprising:
preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of a corpus formed by the m independent texts, wherein m is a positive integer;
taking each vocabulary of the corpus as a concept subject word, traversing all sentences and vocabularies, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements;
screening vocabulary elements in a vocabulary set;
according to the filtered vocabulary set XIDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word;
constructing a screened vocabulary set X according to the importance of each vocabulary element and the initial word vectorIThe initial word vector is obtained by a preset word embedding mode.
2. The method of claim 1, wherein filtering vocabulary elements in a vocabulary set comprises:
counting each vocabulary element x in the vocabulary setjAnd concept topic word xiThe number z of the texts which appear together, wherein z is less than or equal to m;
and judging whether the text quantity z is larger than or equal to a first threshold value, if so, taking the vocabulary elements as effective vocabularies of the vocabulary set and keeping the effective vocabularies in the vocabulary set, otherwise, removing the vocabulary elements from the vocabulary set.
3. The method of claim 1, wherein X is selected from the filtered set of wordsIDetermining the importance of each vocabulary element under the same concept topic word by the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word, wherein the importance comprises the following steps:
calculating the filtered vocabulary set XIThe reverse of each vocabulary elementThe frequency index;
using the vocabulary element xjAnd concept topic word xiNumber of co-occurring texts zjAnd vocabulary set XIAll vocabulary elements and concept topic words x iniThe ratio of the sum of the number of the commonly occurring texts represents the co-occurrence condition of the vocabulary elements and the concept subject words;
and determining the importance of each vocabulary element under the same concept subject word according to the inverse text frequency index of each vocabulary element and the calculated ratio.
4. The method of claim 3 wherein the filtered vocabulary set X is computedIThe inverse text frequency index for each vocabulary element, comprising:
according to the total number of vocabulary sets and the included vocabulary element xjThe inverse text frequency index of each vocabulary element is calculated by the vocabulary set, and the calculation formula is as follows:
Figure FDA0002295086740000021
wherein, countXIndicates the total number of vocabulary sets, countxjThe representation contains a lexical element xjNumber of vocabulary sets, IDFjRepresenting a lexical element xjThe inverse text frequency index of (c).
5. The method of claim 3, wherein determining the importance of each vocabulary element under the same concept topic based on the inverse text frequency index and the calculated ratio for each vocabulary element comprises:
and multiplying the inverse text frequency index of each vocabulary element by the calculated ratio to obtain a weight coefficient, wherein the weight coefficient is used for representing the importance of each vocabulary element under the same concept subject word.
6. The method of claim 5, wherein the method is performed according to eachThe importance of the vocabulary elements and the initial word vector, and constructing a screened vocabulary set XIThe word vector of (1), comprising:
and multiplying the weight coefficient of each vocabulary element in the vocabulary set by the initial word vector corresponding to the vocabulary element, and summing to obtain the word vector of the vocabulary set.
7. The method of claim 1, wherein after constructing word vectors for a vocabulary set, the method further comprises:
according to each vocabulary set XIAnd (4) clustering the vectors of all vocabulary sets by adopting a k-means algorithm.
8. An apparatus for constructing a word vector, comprising:
the preprocessing module is used for preprocessing each text contained in a text set containing m independent texts to obtain all sentences and vocabularies of the corpus formed by the m independent texts, wherein m is a positive integer;
the vocabulary set building module is used for traversing all sentences and vocabularies by taking each vocabulary of the corpus as a concept subject word, and bringing the vocabulary which commonly appears in the same sentence with the concept subject word into a vocabulary set corresponding to the concept subject word, wherein the vocabulary set comprises the concept subject word and vocabulary elements;
the vocabulary element screening module is used for screening vocabulary elements in the vocabulary set;
a vocabulary element importance determination module for collecting X vocabulary according to the screened vocabularyIDetermining the importance of each vocabulary element under the same concept topic word according to the inverse text frequency index of each vocabulary element and the co-occurrence condition of each vocabulary element and the concept topic word;
a word vector construction module for constructing the screened vocabulary set X according to the importance of each vocabulary element and the initial word vectorIThe initial word vector is obtained by a preset word embedding mode.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the method of any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.
CN201911197725.3A 2019-11-29 2019-11-29 Word vector construction method and device Active CN112883715B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911197725.3A CN112883715B (en) 2019-11-29 2019-11-29 Word vector construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911197725.3A CN112883715B (en) 2019-11-29 2019-11-29 Word vector construction method and device

Publications (2)

Publication Number Publication Date
CN112883715A true CN112883715A (en) 2021-06-01
CN112883715B CN112883715B (en) 2023-11-07

Family

ID=76038307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911197725.3A Active CN112883715B (en) 2019-11-29 2019-11-29 Word vector construction method and device

Country Status (1)

Country Link
CN (1) CN112883715B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011227688A (en) * 2010-04-20 2011-11-10 Univ Of Tokyo Method and device for extracting relation between two entities in text corpus
JP2014085947A (en) * 2012-10-25 2014-05-12 Nippon Telegr & Teleph Corp <Ntt> Apparatus, method and program for answering question
CN109635077A (en) * 2018-12-18 2019-04-16 武汉斗鱼网络科技有限公司 Calculation method, device, electronic equipment and the storage medium of text similarity
CN110134777A (en) * 2019-05-29 2019-08-16 三角兽(北京)科技有限公司 Problem De-weight method, device, electronic equipment and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011227688A (en) * 2010-04-20 2011-11-10 Univ Of Tokyo Method and device for extracting relation between two entities in text corpus
JP2014085947A (en) * 2012-10-25 2014-05-12 Nippon Telegr & Teleph Corp <Ntt> Apparatus, method and program for answering question
CN109635077A (en) * 2018-12-18 2019-04-16 武汉斗鱼网络科技有限公司 Calculation method, device, electronic equipment and the storage medium of text similarity
CN110134777A (en) * 2019-05-29 2019-08-16 三角兽(北京)科技有限公司 Problem De-weight method, device, electronic equipment and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘嵩;张先飞;李弼程;孙显著;: "基于概念相似度的话题自动检测方法", 信息工程大学学报, no. 03 *
刘金岭;谈芸;李健普;袁娜;: "基于多因素的中文文本主题自动抽取方法", 计算机技术与发展, no. 07 *

Also Published As

Publication number Publication date
CN112883715B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
TWI653542B (en) Method, system and device for discovering and tracking hot topics based on network media data flow
US20150199567A1 (en) Document classification assisting apparatus, method and program
CN111274125B (en) Log analysis method and device
CN109299263B (en) Text classification method and electronic equipment
CN108268470A (en) A kind of comment text classification extracting method based on the cluster that develops
CN110019805A (en) Article Topics Crawling method and apparatus and computer readable storage medium
CN113779246A (en) Text clustering analysis method and system based on sentence vectors
CN115456043A (en) Classification model processing method, intent recognition method, device and computer equipment
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
JP7221526B2 (en) Analysis method, analysis device and analysis program
CN111161861A (en) Short text data processing method and device for hospital logistics operation and maintenance
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN113850811B (en) Three-dimensional point cloud instance segmentation method based on multi-scale clustering and mask scoring
CN108563713B (en) Keyword rule generation method and device and electronic equipment
CN112287215A (en) Intelligent employment recommendation method and device
Fan et al. Dynamic textures clustering using a hierarchical pitman-yor process mixture of dirichlet distributions
CN112883715B (en) Word vector construction method and device
CN110413985B (en) Related text segment searching method and device
CN111737461A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN116383363A (en) Virtual pet chatting system
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN110674293A (en) Text classification method based on semantic migration
CN107590163B (en) The methods, devices and systems of text feature selection
WO2018223718A1 (en) Trending topic detection method, apparatus and device, and medium
CN114912446A (en) Keyword extraction method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant