CN111191024A - Method for calculating sentence semantic vector - Google Patents

Method for calculating sentence semantic vector Download PDF

Info

Publication number
CN111191024A
CN111191024A CN201811348612.4A CN201811348612A CN111191024A CN 111191024 A CN111191024 A CN 111191024A CN 201811348612 A CN201811348612 A CN 201811348612A CN 111191024 A CN111191024 A CN 111191024A
Authority
CN
China
Prior art keywords
word
sentence
vector
calculated
candidate set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811348612.4A
Other languages
Chinese (zh)
Other versions
CN111191024B (en
Inventor
罗立刚
刘辉
张正宽
张天泽
常涛
王玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zero Krypton Technology Tianjin Co Ltd
Original Assignee
Zero Krypton Technology Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zero Krypton Technology Tianjin Co Ltd filed Critical Zero Krypton Technology Tianjin Co Ltd
Priority to CN201811348612.4A priority Critical patent/CN111191024B/en
Publication of CN111191024A publication Critical patent/CN111191024A/en
Application granted granted Critical
Publication of CN111191024B publication Critical patent/CN111191024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for calculating sentence semantic vectors, which comprises the following steps: A. performing word segmentation on each sentence sample in the corpus to obtain a word set, and training by adopting a word vector generation tool to obtain a word vector of each word to form a word vector set; B. performing word vector mean calculation on the sentence to be calculated through the word vector set to obtain a sentence vector of the sentence to be calculated; C. finding out a plurality of words with the highest similarity with each constituent word of the sentence to be calculated in the word set, and respectively forming a candidate set; D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector of the candidate set by taking the distance as the weight to obtain the semantic vector of the sentence to be calculated. The method performs the calculation of sentence semantic vectors by combining the word vectors of adjacent words of each constituent word in the sentence, fully uses the semantic information of all words, and enables the expression to be more reasonable.

Description

Method for calculating sentence semantic vector
Technical Field
The invention relates to the technical field of text information processing, in particular to a method for calculating semantic vectors of sentences.
Background
The internet gradually becomes an information carrier for recording the life and work of people, which brings convenience for obtaining information for the life and work of people, meanwhile, a large amount of text data is generated in the using process, important information is timely and effectively extracted from complex text data, and the Natural Language (Natural Language) needs to be effectively processed by artificial intelligence, while in the field of Natural Language Processing (NLP), the calculation of sentence semantics is a basic semantic expression, and a reasonable sentence semantic expression mode can provide favorable support for downstream application effects. The traditional sentence semantic expression is usually recalculation of word vectors, the most common mode is to find the average value of the word vectors in a sentence, or to use an intermediate result generated in neural network training as the sentence vectors, however, because the constituent words of the sentence sometimes have the condition of inaccurate word usage or incorrect expression mode, at this time, the existing sentence semantic expression mode also has inaccurate result, and cannot provide favorable support for downstream application.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide a method for calculating a sentence semantic vector, which performs the calculation of the sentence semantic vector by combining word vectors of neighboring words of each constituent word in a sentence, fully uses semantic information of all words, has the advantages of simple implementation manner, more reasonable expression, and the like, can solve the problem of wrong sentence semantic expression caused by wrong constituent words or expression manner of the sentence, and provides a favorable support for downstream application effects.
The technical scheme adopted by the invention is that a method for calculating sentence semantic vectors comprises the following steps:
A. performing word segmentation on each sentence sample in the corpus to obtain a word set, and training by adopting a word vector generation tool to obtain a word vector of each word to form a word vector set;
B. performing word vector mean calculation on the sentence to be calculated through the word vector set to obtain a sentence vector of the sentence to be calculated;
C. finding out a plurality of words with the highest similarity with each constituent word of the sentence to be calculated in the word set, and respectively forming a candidate set;
D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector of the candidate set by taking the distance as the weight to obtain the semantic vector of the sentence to be calculated.
In the above way, word segmentation is performed on each sentence sample in the existing corpus to obtain a word set, a word vector generation tool is used to obtain a word vector of each word, the word vectors are substituted into the sentence samples to perform vector mean, and thus, the sentence vectors can be obtained. However, because the word composition of the sentence sometimes has errors or the expression mode is not fixed, a plurality of words nearest to each composition word of the sentence can be selected to form a candidate set, the distance between the words in the candidate set and the sentence vector is taken as the weight and multiplied by the word vector of the candidate set, and the vector of the adjacent word is introduced into the sentence to obtain an expression mode with more reasonable sentence semantics, the error rate is low, the problem that the sentence semantic expression is wrong due to the fact that the composition words or the expression mode of the sentence is wrong can be solved, and the beneficial support is provided for the downstream application effect.
Wherein the step B comprises:
extracting word vectors of all the constituent words of the sentence to be calculated from the word vector set;
and carrying out mean value calculation on the extracted word vectors to obtain sentence vectors of the sentences to be calculated.
In the above way, the sentence vector of the sentence to be calculated is obtained by obtaining the word vector of each constituent word constituting the sentence and performing the average calculation, and the sentence vector is only the basic vector and is not the calculated final sentence vector.
Wherein the step C comprises:
calculating the similarity between each constituent word of the sentence to be calculated and the word set through a proximity algorithm;
selecting a plurality of phrases with highest similarity to form a candidate set, wherein the number of the candidate set is the same as the number of the constituent words of the sentence to be calculated;
and selecting each word vector in the candidate set from the word vector set to form a word vector set of the candidate set.
By the above, through comparing the similarity between the constituent words of the sentence to be calculated and other words in the word set, and selecting a plurality of words with the highest similarity to form a candidate set, if there are several constituent words in the sentence, the candidate sets with the same quantity can be correspondingly generated.
Wherein the step D comprises:
calculating each distance between each word in the candidate set and the sentence vector;
and correspondingly multiplying each distance by the word vector of each word in the candidate set, and performing mean value calculation on the multiplied results to obtain the semantic vector of the sentence to be calculated.
Therefore, because the similarity of each word in the candidate set and the sentence constituting word is different, the distance between each word and the sentence vector is used as a specific gravity, then the word vector in the candidate set is multiplied, and the mean value calculation is carried out on the multiplied result, so that the final semantic vector of the sentence to be calculated can be obtained, and better support is provided for downstream application.
Drawings
FIG. 1 is a flow chart of a method of calculating a sentence semantic vector according to the present invention.
Detailed Description
The following describes in detail a method for calculating a sentence semantic vector according to the present invention with reference to fig. 1, which specifically includes the following steps:
s100: performing word segmentation on each sentence sample in the corpus to obtain a word set, and training by adopting a word vector generation tool to obtain a word vector of each word to form a word vector set;
setting a corpus sample set S1, wherein the set S1 comprises a plurality of sentence samples Si, i is a natural number greater than or equal to 1, performing word segmentation on the sentence samples Si to obtain word segmentation results Wij, and the word segmentation results Wij obtain a word set S2.
And performing unsupervised word vector training on the word set S2 to obtain a word vector set Wij _ vec. The word vector has good semantic characteristics and is a common way to represent word features. The value of each dimension of the word vector represents a feature having a certain semantic and grammatical interpretation, so each dimension of the word vector may be referred to as a word feature. In this embodiment, Word vectors in a Word set may be trained using a Word2vec model, which is a software tool for training Word vectors opened by Google corporation in 2013. According to a given corpus, a word is quickly and effectively expressed into a vector form through an optimized training model.
S200: performing word vector mean calculation on the sentence to be calculated through the word vector set to obtain a sentence vector of the sentence to be calculated;
the method comprises the steps of firstly, carrying out word segmentation on a sentence to be calculated, extracting word vectors of all constituent words of the sentence to be calculated from the word vector set Wij _ vec, and carrying out mean value calculation on the extracted word vectors to obtain a sentence vector of the sentence to be calculated.
Specifically, if the sentence vector corresponding to the sentence to be calculated is W1, W1 is sum (Wij _ vec), where j is 1 to k, and k is the number of the constituent words in the sentence to be calculated.
S300: finding out a plurality of words with the highest similarity with each formed word of the sentence to be calculated in the word set, respectively forming a candidate set and training a word vector of each word of the candidate set;
the word similarity is a quantitative measure of the number quantization of complex relationships among words and is a quantitative measure of the semantic similarity degree among words, and the specific calculation process is as follows:
calculating the similarity between each constituent word in the word set W2 and each sentence to be calculated through a proximity algorithm;
in the embodiment of the invention, the algorithm of the proximity similarity can be calculated by a general cosine similarity formula or an Euclidean distance formula, wherein the cosine similarity uses a cosine value of an included angle between two vectors in a vector space to measure the similarity between two texts, compared with distance measurement, the cosine similarity pays more attention to the difference of the two vectors in the direction, and under the general condition, after the vectors of the two texts are obtained by an embedding method to represent, the cosine similarity can be used for calculating the similarity between the two texts; euclidean distance, also known as euclidean distance, is the most common distance metric, and refers to the true distance between two points in a multi-dimensional space, or the natural length of a vector (i.e., the distance of the point from the origin), and euclidean distance in two and three dimensions is the actual distance between two points.
For example, "i like to eat apple" and "i like to apple mobile phone", the word "apple" in the two sentence samples belongs to the fruit field and the mobile phone field respectively, at this time, through the calculation of the similarity of the adjacent words, if "apple" has higher word vector similarity with fruit vocabularies such as "orange", "banana", etc., it represents that the vocabulary belongs to the fruit field, if "apple" has higher word vector similarity with mobile phone vocabularies such as "samsung", "millet", etc., it represents that the vocabulary belongs to the mobile phone field.
According to the ranking of the similarity, selecting a plurality of words with the highest similarity to respectively form a candidate set TSet, wherein the number of the candidate set TSet is the same as the number of the constituent words of the sentence to be calculated, namely selecting a plurality of words with the highest similarity to form the candidate set TSet for each constituent Word of the sentence to be calculated, and training Word vectors in the candidate set TSet by adopting a Word2vec model.
S400: calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector of the candidate set by taking the distance as the weight to obtain the semantic vector of the sentence to be calculated;
in this step, depending on the calculation result of the similarity between each word in the candidate set TSet and the constituent word of the sentence to be calculated in step S300, the semantic vector of the sentence to be calculated can be obtained by calculating each distance between each word in the candidate set TSet and the sentence vector W1, multiplying each distance by the word vector of each word in the candidate set TSet with each distance as a weight, and performing mean calculation on the multiplied results.
For example, the sentence "i like eating apple" is that, when calculating the semantic vector, the vectors of the neighboring words "orange" and "banana" of the word "apple" are introduced into the semantic, and the distance between each neighboring word and the sentence vector is used as a proportion and multiplied by the word vector, so that the expression of the sentence semantic vector is more reasonable, and the occurrence of errors is avoided.
In summary, the invention performs the calculation of the sentence semantic vector by combining the word vectors of the adjacent words of each constituent word in the sentence, fully uses the semantic information of all words, has the advantages of simple implementation mode, more reasonable expression and the like, can solve the problem of wrong sentence semantic expression caused by wrong sentence constituent words or expression modes, and provides favorable support for downstream application effects.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. A method of computing a sentence semantic vector, comprising the steps of:
A. performing word segmentation on each sentence sample in the corpus to obtain a word set, and training by adopting a word vector generation tool to obtain a word vector of each word to form a word vector set;
B. performing word vector mean calculation on the sentence to be calculated through the word vector set to obtain a sentence vector of the sentence to be calculated;
C. finding out a plurality of words with the highest similarity with each constituent word of the sentence to be calculated in the word set, and respectively forming a candidate set;
D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector of the candidate set by taking the distance as the weight to obtain the semantic vector of the sentence to be calculated.
2. The method of claim 1, wherein step B comprises:
extracting word vectors of all the constituent words of the sentence to be calculated from the word vector set;
and carrying out mean value calculation on the extracted word vectors to obtain sentence vectors of the sentences to be calculated.
3. The method of claim 2, wherein step C comprises:
calculating the similarity between each constituent word of the sentence to be calculated and the word set through a proximity algorithm;
selecting a plurality of phrases with highest similarity to form a candidate set, wherein the number of the candidate set is the same as the number of the constituent words of the sentence to be calculated;
and selecting each word vector in the candidate set from the word vector set to form a word vector set of the candidate set.
4. The method of claim 3, wherein step D comprises:
calculating each distance between each word in the candidate set and the sentence vector;
and correspondingly multiplying each distance by the word vector of each word in the candidate set, and performing mean value calculation on the multiplied results to obtain the semantic vector of the sentence to be calculated.
CN201811348612.4A 2018-11-13 2018-11-13 Method for calculating sentence semantic vector Active CN111191024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811348612.4A CN111191024B (en) 2018-11-13 2018-11-13 Method for calculating sentence semantic vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811348612.4A CN111191024B (en) 2018-11-13 2018-11-13 Method for calculating sentence semantic vector

Publications (2)

Publication Number Publication Date
CN111191024A true CN111191024A (en) 2020-05-22
CN111191024B CN111191024B (en) 2023-06-23

Family

ID=70705086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811348612.4A Active CN111191024B (en) 2018-11-13 2018-11-13 Method for calculating sentence semantic vector

Country Status (1)

Country Link
CN (1) CN111191024B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008186A (en) * 2014-06-11 2014-08-27 北京京东尚科信息技术有限公司 Method and device for determining keywords in target text
US20160350283A1 (en) * 2015-06-01 2016-12-01 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement
CN107679144A (en) * 2017-09-25 2018-02-09 平安科技(深圳)有限公司 News sentence clustering method, device and storage medium based on semantic similarity
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008186A (en) * 2014-06-11 2014-08-27 北京京东尚科信息技术有限公司 Method and device for determining keywords in target text
US20160350283A1 (en) * 2015-06-01 2016-12-01 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement
CN107679144A (en) * 2017-09-25 2018-02-09 平安科技(深圳)有限公司 News sentence clustering method, device and storage medium based on semantic similarity
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙志远;王伟;马迪;毛伟;: "移动营销领域的文本相似度计算方法" *
黄江平;姬东鸿;: "基于句子语义距离的释义识别研究" *

Also Published As

Publication number Publication date
CN111191024B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN108073568B (en) Keyword extraction method and device
CN107861939B (en) Domain entity disambiguation method fusing word vector and topic model
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
WO2020244073A1 (en) Speech-based user classification method and device, computer apparatus, and storage medium
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
US20150095017A1 (en) System and method for learning word embeddings using neural language models
CN108269125B (en) Comment information quality evaluation method and system and comment information processing method and system
Chen et al. Jointly modeling inter-slot relations by random walk on knowledge graphs for unsupervised spoken language understanding
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
JP4904496B2 (en) Document similarity derivation device and answer support system using the same
US20200278976A1 (en) Method and device for evaluating comment quality, and computer readable storage medium
CN110414004A (en) A kind of method and system that core information extracts
CN113051368B (en) Double-tower model training method, retrieval device and electronic equipment
CN112347241A (en) Abstract extraction method, device, equipment and storage medium
CN110705247A (en) Based on x2-C text similarity calculation method
Yoshino et al. Dialogue state tracking using long short term memory neural networks
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN113449084A (en) Relationship extraction method based on graph convolution
CN110347833B (en) Classification method for multi-round conversations
CN113326374B (en) Short text emotion classification method and system based on feature enhancement
CN116245139B (en) Training method and device for graph neural network model, event detection method and device
CN111460117A (en) Dialog robot intention corpus generation method, device, medium and electronic equipment
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN111563361A (en) Text label extraction method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant