CN111191024B - Method for calculating sentence semantic vector - Google Patents

Method for calculating sentence semantic vector Download PDF

Info

Publication number
CN111191024B
CN111191024B CN201811348612.4A CN201811348612A CN111191024B CN 111191024 B CN111191024 B CN 111191024B CN 201811348612 A CN201811348612 A CN 201811348612A CN 111191024 B CN111191024 B CN 111191024B
Authority
CN
China
Prior art keywords
word
sentence
vector
calculated
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811348612.4A
Other languages
Chinese (zh)
Other versions
CN111191024A (en
Inventor
罗立刚
刘辉
张正宽
张天泽
常涛
王玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zero Krypton Technology Tianjin Co ltd
Original Assignee
Zero Krypton Technology Tianjin Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zero Krypton Technology Tianjin Co ltd filed Critical Zero Krypton Technology Tianjin Co ltd
Priority to CN201811348612.4A priority Critical patent/CN111191024B/en
Publication of CN111191024A publication Critical patent/CN111191024A/en
Application granted granted Critical
Publication of CN111191024B publication Critical patent/CN111191024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for calculating sentence semantic vectors, which comprises the following steps: A. word segmentation is carried out on each sentence sample in the corpus to obtain a word set, a word vector generating tool is adopted for training to obtain a word vector of each word, and the word vector set is formed; B. carrying out word vector average value calculation on the sentence to be calculated through the word vector set to obtain the sentence vector of the sentence to be calculated; C. finding out a plurality of words with highest similarity with each constituent word of the sentence to be calculated from the word set, and respectively forming a candidate set; D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector in the candidate set by taking the distance as a weight to obtain the semantic vector of the sentence to be calculated. The method carries out the calculation of the semantic vectors of the sentences by combining the word vectors of each adjacent word constituting the word in the sentences, fully uses the semantic information of all the words, and ensures that the expression is more reasonable.

Description

Method for calculating sentence semantic vector
Technical Field
The invention relates to the technical field of text information processing, in particular to a method for calculating sentence semantic vectors.
Background
The internet gradually becomes an information carrier for recording life and work of people, brings convenience for acquiring information to life and work of people, simultaneously generates a large amount of text data in the process of being used, timely and effectively extracts important information from complex text data, and needs to rely on artificial intelligence to effectively process Natural Language (Natural Language), and in the field of Natural Language processing (Natural Language Processing, NLP), sentence semantic calculation is a basic semantic expression, and a reasonable sentence semantic expression mode can provide beneficial support for downstream application effects. The traditional sentence semantic generation is usually the recalculation of word vectors, the most common way is to average word vectors in sentences, or to generate intermediate results in neural network training as sentence vectors, however, because the word forming of sentences sometimes occurs in the case of inaccurate words or incorrect expression modes, at this time, the existing sentence semantic expression modes also have inaccurate results, and no favorable support can be provided for downstream applications.
Disclosure of Invention
Therefore, the main purpose of the present invention is to provide a method for calculating semantic vectors of sentences, which combines word vectors of adjacent words of each constituent word in sentences to calculate semantic vectors of sentences, fully uses semantic information of all words, has advantages of simple implementation manner, more reasonable expression, and the like, and can solve the problem of incorrect semantic expression of sentences caused by incorrect constituent words or expression manners of sentences, thereby providing favorable support for downstream application effects.
The technical scheme adopted by the invention is that the method for calculating the sentence semantic vector comprises the following steps:
A. word segmentation is carried out on each sentence sample in the corpus to obtain a word set, a word vector generating tool is adopted for training to obtain a word vector of each word, and the word vector set is formed;
B. carrying out word vector average value calculation on the sentence to be calculated through the word vector set to obtain the sentence vector of the sentence to be calculated;
C. finding out a plurality of words with highest similarity with each constituent word of the sentence to be calculated from the word set, and respectively forming a candidate set;
D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector in the candidate set by taking the distance as a weight to obtain the semantic vector of the sentence to be calculated.
By the method, each sentence sample in the existing corpus is subjected to word segmentation to obtain a word set, a word vector of each word is obtained by using a word vector generating tool, the word vector is substituted into the sentence sample, and vector average is carried out, so that sentence vectors can be obtained. However, because the word composition of the sentence sometimes has errors or the expression mode is not fixed, a plurality of words nearest to each composition word of the sentence can be selected to form a candidate set, the distance between the words in the candidate set and the sentence vector is taken as a weight, the distance is multiplied by the word vector in the candidate set, the vector of the adjacent word is introduced into the sentence, so that a more reasonable expression mode of the sentence semantic is obtained, the error rate is lower, the condition that the sentence semantic expression is wrong due to the fact that the sentence composition word or the expression mode is wrong can be solved, and the beneficial support is provided for the downstream application effect.
Wherein, the step B comprises the following steps:
extracting word vectors of each constituent word of the sentence to be calculated from the word vector set;
and carrying out mean value calculation on the extracted word vectors to obtain sentence vectors of sentences to be calculated.
By the above, the sentence vector of the sentence to be calculated is obtained by obtaining the word vectors of the respective constituent words constituting the sentence and performing the average calculation, and the sentence vector is only the basic vector, not the calculated final sentence vector.
Wherein, the step C comprises the following steps:
calculating the similarity between each constituent word in the word set and the sentence to be calculated through a proximity algorithm;
selecting a plurality of phrases with highest similarity to form candidate sets, wherein the number of the candidate sets is the same as the number of the constituent words of the sentences to be calculated;
and selecting each word vector in the candidate set from the word vector set to form a word vector set of the candidate set.
By comparing the similarity between the constituent words of the sentence to be calculated and other words in the word set, and selecting a plurality of word-forming candidate sets with the highest similarity, the same number of candidate sets can be correspondingly generated if the sentence has a plurality of constituent words.
Wherein, the step D comprises the following steps:
calculating each distance between each word in the candidate set and the sentence vector;
and correspondingly multiplying each distance by the word vector of each word in the candidate set, and carrying out mean value calculation on the multiplied result to obtain the semantic vector of the sentence to be calculated.
From the above, because the similarity of each word and sentence in the candidate set is different, the distance between each word and sentence vector is also different, the distance between each word and sentence vector is taken as the specific gravity, the word vector in the candidate set is multiplied, and the average value of the multiplied result is calculated, so that the final semantic vector of the sentence to be calculated can be obtained, and better support is provided for downstream application.
Drawings
FIG. 1 is a flow chart of a method of computing sentence semantic vectors according to the present invention.
Detailed Description
The following describes the method for calculating the semantic vectors of sentences according to the present invention in detail with reference to fig. 1, which specifically includes the following steps:
s100: word segmentation is carried out on each sentence sample in the corpus to obtain a word set, a word vector generating tool is adopted for training to obtain a word vector of each word, and the word vector set is formed;
providing a corpus sample set S1, wherein the set S1 comprises a plurality of sentence samples Si, i is a natural number which is greater than or equal to 1, and performing word segmentation on the sentence samples Si to obtain a word segmentation result Wij, and the word segmentation result Wij is used for obtaining a word set S2.
And performing unsupervised word vector training on the word set S2 to obtain a word vector set Wij_vec. The word vector has good semantic characteristics and is a common way of representing word characteristics. The value of each dimension of the word vector represents a feature that has some semantic and grammatical interpretation, so each dimension of the word vector may be referred to as a word feature. In this embodiment, word2vec model, which is a software tool for training Word vectors that is open by Google corporation in 2013, may be used to train Word vectors in a Word set. According to a given corpus, a word is quickly and effectively expressed into a vector form through an optimized training model.
S200: carrying out word vector average value calculation on the sentence to be calculated through the word vector set to obtain the sentence vector of the sentence to be calculated;
firstly, dividing words of a sentence to be calculated, extracting word vectors of all words constituting the sentence to be calculated from the word vector set Wij_vec, and carrying out mean value calculation on the extracted word vectors to obtain the sentence vector of the sentence to be calculated.
Specifically, let the sentence vector corresponding to the sentence to be calculated be W1, w1=sum (wij_vec), where j belongs to 1 to k, and k is the number of constituent words in the sentence to be calculated.
S300: finding out a plurality of words with highest similarity with each constituent word of the sentence to be calculated from the word set, respectively forming a candidate set and training word vectors of each word of the candidate set;
the term similarity is the quantity of complex relations among terms, is a quantitative measure of semantic similarity degree among terms, and comprises the following specific calculation processes:
calculating the similarity between each constituent word in the word set W2 and the sentence to be calculated through a proximity algorithm;
in the embodiment of the invention, a general cosine similarity formula or a Euclidean distance formula can be selected for calculation by an algorithm of the adjacent similarity, wherein the cosine similarity measures the similarity between two texts by using a cosine value of an included angle of two vectors in a vector space, compared with distance measurement, the cosine similarity is more focused on the difference of the two vectors in the direction, and generally, after the vector representation of the two texts is obtained by an embedding method, the similarity between the two texts can be calculated by using the cosine similarity; euclidean distance, also called Euclidean distance, is the most common distance measure, referring to the true distance between two points in a multidimensional space, or the natural length of a vector (i.e., the distance from the point to the origin), the Euclidean distance in two-dimensional and three-dimensional space being the actual distance between two points.
For example, "i like eating apples" and "i like apple cell phones", the term "apple" in two sentence samples belongs to fruit field and cell phone field respectively, at this time, through the calculation of the similarity of adjacent terms, if "apple" has higher word vector similarity with fruit terms such as "orange", "banana", etc., it represents that this term belongs to fruit field, if "apple" has higher word vector similarity with cell phone terms such as "samsung", "millet", etc., it represents that this term belongs to cell phone field.
According to the ranking of the similarity, selecting a plurality of words with highest similarity to form candidate sets TSet respectively, wherein the number of the candidate sets TSet is the same as the number of the constituent words of the sentences to be calculated, namely, each constituent Word of the sentences to be calculated selects a plurality of Word groups with highest similarity to form the candidate sets TSet, and Word2vec models are adopted to train Word vectors in the candidate sets TSet.
S400: calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector in the candidate set by taking the distance as a weight to obtain the semantic vector of the sentence to be calculated;
the step relies on the calculation result of the similarity between each word in the candidate set TSet and the constituent words of the sentence to be calculated in step S300, and the semantic vector of the sentence to be calculated can be obtained by calculating each distance between each word in the candidate set TSet and the sentence vector W1, correspondingly multiplying each distance by the word vector of each word in the candidate set TSet with each distance as a weight, and performing mean value calculation on the multiplied result.
For example, when the sentence "i likes eating apples" is calculated by semantic vectors, the vectors of the adjacent words "orange" and "banana" of the word "apple" are introduced into the semantics, and the distance between each adjacent word and the sentence vector is taken as the specific gravity, and the word vector is multiplied, so that the expression of the sentence semantic vector is more reasonable, and the occurrence of errors is avoided.
In summary, the invention calculates the semantic vector of the sentence by combining the word vector of each adjacent word constituting the word in the sentence, fully uses the semantic information of all the words, has the advantages of simple implementation mode, more reasonable expression and the like, can solve the problem of incorrect sentence semantic expression caused by incorrect sentence constitution words or expression modes, and provides favorable support for downstream application effect.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (2)

1. A method of computing a semantic vector for a sentence, comprising the steps of:
A. word segmentation is carried out on each sentence sample in the corpus to obtain a word set, a word vector generating tool is adopted for training to obtain a word vector of each word, and the word vector set is formed;
B. carrying out word vector average value calculation on the sentence to be calculated through the word vector set to obtain the sentence vector of the sentence to be calculated;
C. calculating the similarity between each constituent word in the word set and the sentence to be calculated through a proximity algorithm; selecting a plurality of phrases with highest similarity to form candidate sets, wherein the number of the candidate sets is the same as the number of the constituent words of the sentences to be calculated; selecting each word vector in the candidate set from the word vector set to form a word vector set of the candidate set;
D. calculating each distance between each word in the candidate set and the sentence vector; and correspondingly multiplying each distance by the word vector of each word in the candidate set, and carrying out mean value calculation on the multiplied result to obtain the semantic vector of the sentence to be calculated.
2. The method according to claim 1, wherein said step B comprises:
extracting word vectors of each constituent word of the sentence to be calculated from the word vector set;
and carrying out mean value calculation on the extracted word vectors to obtain sentence vectors of sentences to be calculated.
CN201811348612.4A 2018-11-13 2018-11-13 Method for calculating sentence semantic vector Active CN111191024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811348612.4A CN111191024B (en) 2018-11-13 2018-11-13 Method for calculating sentence semantic vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811348612.4A CN111191024B (en) 2018-11-13 2018-11-13 Method for calculating sentence semantic vector

Publications (2)

Publication Number Publication Date
CN111191024A CN111191024A (en) 2020-05-22
CN111191024B true CN111191024B (en) 2023-06-23

Family

ID=70705086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811348612.4A Active CN111191024B (en) 2018-11-13 2018-11-13 Method for calculating sentence semantic vector

Country Status (1)

Country Link
CN (1) CN111191024B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008186A (en) * 2014-06-11 2014-08-27 北京京东尚科信息技术有限公司 Method and device for determining keywords in target text
CN107679144A (en) * 2017-09-25 2018-02-09 平安科技(深圳)有限公司 News sentence clustering method, device and storage medium based on semantic similarity
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9672206B2 (en) * 2015-06-01 2017-06-06 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008186A (en) * 2014-06-11 2014-08-27 北京京东尚科信息技术有限公司 Method and device for determining keywords in target text
CN107679144A (en) * 2017-09-25 2018-02-09 平安科技(深圳)有限公司 News sentence clustering method, device and storage medium based on semantic similarity
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙志远 ; 王伟 ; 马迪 ; 毛伟 ; .移动营销领域的文本相似度计算方法.计算机应用.(第S1期),全文. *
黄江平 ; 姬东鸿 ; .基于句子语义距离的释义识别研究.四川大学学报(工程科学版).(第06期),全文. *

Also Published As

Publication number Publication date
CN111191024A (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN108073568B (en) Keyword extraction method and device
Colombo et al. Automatic text evaluation through the lens of Wasserstein barycenters
WO2017162134A1 (en) Electronic device and method for text processing
TW202009749A (en) Human-machine dialog method, device, electronic apparatus and computer readable medium
CN106855853A (en) Entity relation extraction system based on deep neural network
CN108269125B (en) Comment information quality evaluation method and system and comment information processing method and system
Chen et al. Jointly modeling inter-slot relations by random walk on knowledge graphs for unsupervised spoken language understanding
Yoshino et al. Dialogue state tracking using long short term memory neural networks
CN113326374A (en) Short text emotion classification method and system based on feature enhancement
CN110347833B (en) Classification method for multi-round conversations
CN113821588A (en) Text processing method and device, electronic equipment and storage medium
CN116245139B (en) Training method and device for graph neural network model, event detection method and device
CN113065350A (en) Biomedical text word sense disambiguation method based on attention neural network
CN110019556A (en) A kind of topic news acquisition methods, device and its equipment
CN111191024B (en) Method for calculating sentence semantic vector
Visser et al. Sentiment and intent classification of in-text citations using bert
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
Song et al. Hyperrank: hyperbolic ranking model for unsupervised keyphrase extraction
JP5916016B2 (en) Synonym determination device, synonym learning device, and program
Ling Coronavirus public sentiment analysis with BERT deep learning
Jayawickrama et al. Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts
CN113190681A (en) Fine-grained text classification method based on capsule network mask memory attention
Putra et al. Analyzing sentiments on official online lending platform in Indonesia with a Combination of Naive Bayes and Lexicon Based Method
JP4314271B2 (en) Inter-word relevance calculation device, inter-word relevance calculation method, inter-word relevance calculation program, and recording medium recording the program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant