CN110347796A - Short text similarity calculating method under vector semantic tensor space - Google Patents

Short text similarity calculating method under vector semantic tensor space Download PDF

Info

Publication number
CN110347796A
CN110347796A CN201910607928.9A CN201910607928A CN110347796A CN 110347796 A CN110347796 A CN 110347796A CN 201910607928 A CN201910607928 A CN 201910607928A CN 110347796 A CN110347796 A CN 110347796A
Authority
CN
China
Prior art keywords
word
semantic
vector
document
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910607928.9A
Other languages
Chinese (zh)
Inventor
李民
陈龙
单英哲
崔豪楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synthesis Electronic Technology Co Ltd
Original Assignee
Synthesis Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synthesis Electronic Technology Co Ltd filed Critical Synthesis Electronic Technology Co Ltd
Priority to CN201910607928.9A priority Critical patent/CN110347796A/en
Publication of CN110347796A publication Critical patent/CN110347796A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses the short text similarity calculating method under a kind of vector semantic tensor space, this method carries out natural language processing, statistical analysis and vector analysis to the training dataset of input or test sample first, obtains the vector distance of user query sentence and training sample;Then Semantic Come-back prediction is carried out to the statistical data set obtained by natural language processing, statistical analysis, obtains the semantic distance of user query sentence and training sample;Tensor space is finally established based on vector distance distribution and semantic distance distribution, the similarity between short text is being calculated in the tensor space of foundation.Document is alleviated or eliminated to the present invention, word different length influences similarity bring, while completing Text similarity computing under vector sum semantic tensor space for solving the problems, such as short text similarity calculation under condition of small sample.

Description

Short text similarity calculating method under vector semantic tensor space
Technical field
A kind of short essay the present invention relates to short text similarity calculating method, under specifically a kind of vector semantic tensor space This similarity calculating method belongs to field of artificial intelligence.
Background technique
The Intelligent dialogue system core is the accuracy of semantic understanding, and accuracy is higher, provides more accurate, the user of service Experience is more preferable.But mainly there are two aspects for Chinese natural language semantic understanding difficulty, language is the mankind to objective things event It is artificial abstract, so language is subjective and changeable, be especially apparent in the processing of Chinese;The information of another aspect language transmitting is past Past and context-sensitive.The information sparsity and word randomness of short text further increase short text analysis processing simultaneously Difficulty, on the other hand also embodying research short text analysis has highly important theory significance.
The prior art using Word2Vec realize text vector, Word2Vec to the quantity of training data, quality and Domain knowledge completeness is more demanding.And specific industry data volume is usually not enough in actual environment, under normal circumstances can not By third party's channel obtain quality data sample, cause in actual production environment or small sample scene under can not obtain The text vectorization of high quality indicates, and then influences the accuracy of short text similarity.
Summary of the invention
The technical problem to be solved in the present invention is to provide the short text similarity calculations under a kind of vector semantic tensor space Method, for solving the problems, such as short text similarity calculation under condition of small sample, alleviating or eliminating document, word different length to phase It is influenced like degree bring, while completing Text similarity computing under vector sum semantic tensor space.
Technical problem in order to solve described problem, the technical solution adopted by the present invention is that: under vector semantic tensor space Short text similarity calculating method, comprising the following steps: S01), nature is carried out to the training dataset or test sample of input Language Processing, statistical analysis and vector analysis obtain the vector distance of user query sentence and training sample;S02), to step Semantic Come-back prediction is carried out by the statistical data set that natural language processing, statistical analysis obtain in S01, obtains user query The semantic distance of sentence and training sample;S03), the language obtained based on the step S01 vector distance distribution obtained and step S02 Adopted range distribution establishes tensor space, and the value of vector distance and semantic distance is in [0,1], then in the tensor space of foundation Similarity between interior calculating short text.
Further, in step S02, semantic analysis prediction, the tool of semantic analysis prediction are carried out by Semantic Come-back model Body process are as follows: S51), first initialize Semantic Come-back model, that is, obtain all training sample entity weight set WordIDF and exist Semantic coding set in Chinese thesaurus then extracts all entity word Chinese thesaurus codings in user query sentence, so Afterwards according to entity coding and as branch where leaf node, based on Semantic Come-back model calculate between word it is semantic away from From Semantic Come-back model are as follows:Score (wordi,wordj) indicate wordiWith wordjBetween semantic distance, wherein weight value is with wordiWith wordjIt compiles at place Code branch rank is related, and when the code branch of two words is located at the 1st grade, weight=0.00, code branch is located at the 2nd grade When, weight=0.65, when code branch is located at 3rd level, weight=0.80, when code branch is located at the 4th grade, weight= 0.90, when code branch is located at the 5th grade, weight=0.96, n represent the node total number that word is sitting in branch's layer, and k represents difference Branch's distance between semantic item;S52), traversal obtains training sample entity word and query statement entity word, calculates between word Minimum semantic distance, the mean value of the minimum semantic distance of last query statement entity word and as semantic distance distsemIt is defeated Out.
Further, wordiWith wordjWhen code branch is identical, if coding ending is equal sign, Score= 0.00, if coding ending is No. #, Score=0.5.
Further, in step S03, the similarity between short text, formula are characterized by Minkowski Distance are as follows:(11), distvecIndicate vector distance, distsemIndicate semantic distance, p is distance coefficient.
Further, the value of p is determined by way of cross validation, p is equal to 1 or 2.
Further, statistical analysis includes training process and prediction process, and training process completes training set word entity word Weight calculation, prediction process are the weighted values that each entity word is calculated when user input query sentence, wherein training process Specifically: S31), to natural language processing generate structured text successively carry out deletion duplicate keys, merge synonym processing, Training set of words and document keyword matrix DocList are constructed, wherein document keyword matrix every record includes one Standard question sentence word list corresponding with its;S32), traversal calculates each word word in training set of wordsiContribution to problem Degree is weight, finally obtains entity word weight set WordIDF, document keyword matrix DocList and entity word weight set WordIDF is the statistical data set that statistical analysis obtains;Prediction process is when user input query sentence in WordIDF collection The weight of entity is extracted in conjunction according to the entity word in query statement.
Further, the weight of each entity word in training set of words, specific formula are calculated using IDF algorithm are as follows:Wherein N is the problems in training set of words sum, niInclude for problem WordiThe problem of sum, 0.5 is harmonic coefficient.
Further, vector analysis includes training process and prediction process, and training process completes training word collection entity word The generation of the relatedness computation and document vector of weight and document, prediction process are calculated when user input query sentence The vector distance of itself and training sample;Training process specifically: S41), traversal DocList and WordIDF, calculate DocList Several keywords and question sentence relevance score of every record, obtain the degree of correlation set of each word Yu each document WordDocCoef, calculation formula are as follows: WordDocCoef (wordi,dj)=IDF (wordi)·R(wordi,dj) (2),Wherein b, k1、k2For Regulatory factor, fiFor wordiIn document djIn the frequency of occurrences,For document djRelative length, qfiFor wordiIt is inquiring The frequency occurred in document, document relative length calculation method are as follows: then the average length Avgl for calculating each document first is counted Calculate the ratio of document average length and all document average lengths, i.e. the relative length Ratl of document, calculation formula are as follows: Represent document diTotal length,Represent document di? Number, N are training set total number of documents;S42), traversal document word matrix D ocList obtains document vector, first by WordIDF Descending arrangement is carried out according to weight and sets vector space dimension as M, and M record is empty as building vector before choosing WordIDF Between entity keyword, using one-hot coding initialize word vectorsThen traversal document word matrix D ocList is obtained The entity word list of each document is taken, and in document word correlation matrix WordDocCoef query entity word list and is somebody's turn to do The degree of correlation of document is indicated by Weight Acquisition sentence vectorFormula is as follows:(8);Prediction process specifically: user, which inputs, completes structured analysis It is for statistical analysis afterwards, query statement and vector space entity key intersections are obtained, entity word vector sum word document is based on Correlation matrix calculates query statement vectorization and indicates, then calculates query statement vector sum training sample vector in advance apart from simultaneously Normalized is completed, then by normalized result or so vector distance distvecOutput.
Further, natural language processing, which is completed to input user, carries out entity information extraction and multidimensional scale designation, output Structured text;The entity information extraction is to realize that natural language segments using natural language processing basic fundamental, after participle Filter high-frequency nonsense words retain specific entity information vocabulary and query vocabulary, are then based on the word list that participle obtains Carry out word expansion and normalization;Multidimensional shield label is the part of speech label of the entity sets based on user dictionary, utilizes syntax point Analysis technology obtains the grammatical relation between key message, completes part-of-speech tagging and the part of speech correction of entity word.
Further, the training dataset of input is certain industry or a certain field question set, including several notes Record, each record include a typical problem and several Similar Problems, and test sample is the text question of user's input.
Beneficial effects of the present invention: the present invention endeavours to solve the problems, such as short text similarity calculation under condition of small sample;Vector Short text similarity calculating method under semantic tensor space, compared to word incorporation model, probabilistic model to sample size and Quality requirement is not high, has better practicability and applicability in actual production practice;Compared to single probabilistic model, draw Enter semantics distance to carry out improving model computational accuracy apart from correction;Compared to traditional pure semantics model, mathematical modulo is introduced Type reduces complexity, provides model generalization ability;Compared to traditional vector space model, introduces weight coefficient b and be used to control Influence of the Document Length to text similarity;Introduce weight coefficient k1Influence of the different word frequency to text similarity is modulated in control, Improve the accuracy of short text similarity.
Detailed description of the invention
Fig. 1 is the flow chart of this method.
Specific embodiment
The present invention is further illustrated in the following with reference to the drawings and specific embodiments.
Embodiment 1
The present embodiment discloses the short text similarity calculating method under a kind of vector semantic tensor space, as shown in Figure 1, packet Include following steps:
S01), training dataset or test sample are obtained using input mould group.
Training dataset is certain industry or a certain field question set, and comprising several records, every record includes One typical problem and several Similar Problems.Test sample is the text question of user's input.
Training dataset does not limit source, and source includes disk document, structuring and unstructured database, webpage, Forum etc..The data of separate sources are standardized in input module, are processed into a typical problem and multiple similar are asked The form of sentence.Sentence is carried out simultaneously and artificially carries out semantic expansion, enhances the generalization ability of training set, output standardization text.Rule Generalized text is problem set and corresponding similar question sentence set.
S02), natural language processing mould group performance specification text completes extraction and the multidimensional scale designation of entity information, defeated Structured text out.
The entity extraction is to realize that natural language segments using natural language processing basic fundamental.Filter high-frequency after participle Nonsense words, for example, I and punctuation mark etc., retain specific entity information vocabulary and query vocabulary, finally obtain The word collection of natural language text;It is then based on the word list that participle obtains and carries out word expansion and normalization.Specific method Including but not limited to realize that efficient word figure scans based on Trie tree construction, Chinese character is all in generation sentence may be at word situation institute The directed acyclic graph (DAG) of composition;Maximum probability path is searched using Dynamic Programming, finds out the maximum cutting group based on word frequency It closes;Word for being not logged in finds out most possible group using Viterbi algorithm using the HMM model based on Chinese character at word ability It closes.
The multidimensional scale designation refers to the part of speech label of the entity sets based on user dictionary, utilizes syntactic analysis skill Art, such as finite graph analytic approach, phrase structure analysis, complete grammer, local grammer and dependency analysis etc. obtain key message Between grammer dependence, modified relationship etc., complete entity word part-of-speech tagging and part of speech correction.
In the present embodiment, the structured text of natural language processing mould group output includes the entity word set of natural language text Conjunction and corresponding part of speech set.
S03), the structured text that statistical analysis mould group exports natural language processing mould group is for statistical analysis, including Training process and prediction process.Wherein training process mainly completes training set word entity word weight calculation.Prediction process is to work as The weighted value of each entity word in query statement is calculated when user input query sentence.
In the present embodiment, the training process of statistical analysis is as follows:
S31), it is directed to structured text, successively by deleting duplicate keys, merges the processes such as synonym and constructs training word collection Conjunction and document keyword matrix DocList, wherein document keyword matrix every record includes a standard question sentence and its phase Answer word list.
S32), traversal calculates each word word in set of wordsiTo contribution degree, that is, weight of problem.The calculating side of weight There are many kinds of methods, and the present embodiment uses IDF algorithm, finally obtains term weighing WordIDF set.
Calculation formula are as follows:
Wherein N is the problems in training set sum, niTo contain word in problemiThe problem of sum, wherein 0.5 is Harmonic coefficient, log function are to allow the value of IDF by N and niInfluence it is more smooth.
The statistical data set of final training output includes term weighing set WordIDF and document keyword matrix DocList。
The prediction process of statistical analysis be when user input query sentence WordIDF set according in query statement Entity word extract entity weight.
S04), vector analysis mould group to statistical analysis mould group output data carry out vector analysis, including training process and Prediction process.Wherein training process mainly completes the relatedness computation and document vector of training set entity word weight and document It generates.Prediction process is that the vector distance for inquiring itself and training sample is calculated when user input query sentence.
Wherein the training process of vector analysis is as follows: S41), traverse training aggregated document keyword DocList and WordIDF calculates several keywords and question sentence relevance score of DocList every record, obtains each word and each The degree of correlation set WordDocCoef of document, calculation formula are as follows:
WordDocCoef(wordi,dj)=IDF (wordi)·R(wordi,dj) (2),
Wherein b, k1、k2For regulatory factor, it is arranged generally according to experience, general b=0.75, k1=2, fiFor wordiIn text Shelves djIn the frequency of occurrences,For document djRelative length, qfiFor wordiThe frequency occurred in inquiry document is big absolutely Under partial picture, wordiIt only will appear primary, i.e. qf in queriesi=1, therefore formula 4 simplifies are as follows:
Document relative lengthCalculation method are as follows: calculate the average length Avgl of each document first, then calculate text The ratio of shelves average length and all document average lengths, i.e. the relative length Ratl of document.Document relative length is for correcting Its influence to short text similarity.Calculation formula are as follows:
WhereinRepresent document diTotal length,Represent document diNumber, N be training set total number of documents.
S42), traversal document word matrix D ocList obtains document vector, first drops WordIDF according to weight Sequence arranges and sets vector space dimension as M, and M record is adopted as building vector space entity keyword before choosing WordIDF Word vectors are initialized with one-hot codingVector dimension value non-zero i.e. 1;Then document word matrix D ocList is traversed Obtain the entity word list of each document, and document word correlation matrix WordDocCoef query entity word list with The degree of correlation of the document is indicated by Weight Acquisition sentence vectorFormula is as follows:
In the present embodiment, the prediction process of vector analysis is as follows: user carries out statistical after inputting completion structured analysis Analysis obtains query statement and vector space entity key intersections, based on entity word vector sum word file correlation matrix Calculating query statement vectorization indicates, then calculates query statement vector sum training sample vector COS distance and completes at normalization Reason, finally using normalized result as vector distance distvecOutput.
S05), Semantic Come-back mould group carries out Semantic Come-back prediction, and prediction process is to calculate user input query sentence and instruction Practice the semantic distance between sample.
In the present embodiment, Semantic Come-back model prediction process is as follows: S51), initializing Semantic Come-back model first, that is, obtains Semantic coding set of all training sample entity weight set WordIDF in Chinese thesaurus is taken, inquiry language is then extracted All entity word Chinese thesaurus codings are based on language then according to entity coding and as branch where leaf node in sentence Adopted regression model calculates the semantic distance between word.Semantic Come-back model is as follows:
Score(wordi,wordj) indicate wordiWith wordjBetween semantic distance, wherein weight value is with wordi With wordjPlace code branch rank is related, when the code branch of two words is located at the 1st grade, weight=0.00, and coding When branch is located at the 2nd grade, weight=0.65, when code branch is located at 3rd level, weight=0.80, code branch is located at the 4th When grade, weight=0.90, when code branch is located at the 5th grade, weight=0.96, n represent the node that word is sitting in branch's layer Sum, k represent branch's distance between different semantic items.If coding ending is "=" when code branch is identical, then Score=0.00, if ending is " # " so Score=0.5.
Branch distance k where word between the node total number n and non-synonymity of branch's layer influences the similar of the senses of a dictionary entry Degree.For example:
Aa01A01=personage personage personage people native to a place person
The Aa01A02=mankind stranger whole mankind
Aa01A03=manpower personnel's population population mouthful index finger
Aa01A04=labour labour worker
Aa01A05=ordinary man is personal
Personage and personnel's same branches are Aa01A, with code branch totally 5 of Aa01A beginning, therefore n=5;Personage and people Member where branch's rank be 5 grades, wherein coding in digital level every two represent a rank, so weight=0.96;People The branch of scholar and personnel distance are 2, therefore k=2
Then
S52), traversal obtains training sample entity word and query statement entity word, calculate minimum semanteme between word away from From, the mean value of the minimum semantic distance of last query statement entity word and as semantic distance distsecOutput.
S06), under tensor space similarity calculation module calculate short text similarity
It is primarily based on the distribution of sentence vector distance and tensor space is established in semantic distance distribution, wherein vector distance and language Vector distance of the value of adopted distance between [0,1], sentence is from the semanteme between angle of statistics characterization sentence, and sentence Between semantic sentence then represent the difference between sentence from semantics angle, the semantic distance of two different dimensions can be mutually complementary It fills, therefore uses integrated study thought, the two is placed under same tensor space and carries out comprehensive consideration;Phase between final short text It is characterized like degree by the Minkowski Distance under tensor space.Formula is as follows:
distvecIndicate vector distance, distsemIndicate semantic distance, p is distance coefficient.In actual application may be used It is specific according to specific business datum, the value of p is determined by way of cross validation.General classics value is 1 or 2.Value is 1 When distance be city distance or manhatton distance;Distance is classical Euclidean distance when value is 2.
Described above is only basic principle and preferred embodiment of the invention, and those skilled in the art do according to the present invention Improvement and replacement out, belong to the scope of protection of the present invention.

Claims (10)

1. the short text similarity calculating method under vector semantic tensor space, it is characterised in that: the following steps are included: S01), Training dataset or test sample to input carry out natural language processing, statistical analysis and vector analysis, obtain user and look into Ask the vector distance of sentence and training sample;S02), to the system for passing through natural language processing in step S01, statistical analysis obtains It counts set and carries out Semantic Come-back prediction, obtain the semantic distance of user query sentence and training sample;S03), it is based on step Tensor space is established in the semantic distance distribution that the distribution of vector distance that S01 is obtained and step S02 are obtained, vector distance and it is semantic away from From value in [0,1], then in the tensor space of foundation calculate short text between similarity.
2. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that: In step S02, pass through Semantic Come-back model carry out semantic analysis prediction, semantic analysis prediction detailed process are as follows: S51), first Semantic Come-back model is initialized, that is, obtains semanteme of all training sample entity weight set WordIDF in Chinese thesaurus Code set then extracts all entity word Chinese thesaurus coding in user query sentence, then according to entity coding and As branch where leaf node, the semantic distance between word, Semantic Come-back model are calculated based on Semantic Come-back model are as follows:
Score(wordi,wordj) indicate wordiWith wordjBetween semantic distance, wherein weight value is with wordiWith wordjPlace code branch rank is related, when the code branch of two words is located at the 1st grade, weight=0.00, and coding point When branch is located at the 2nd grade, weight=0.65, when code branch is located at 3rd level, weight=0.80, code branch is located at the 4th grade When, weight=0.90, code branch be located at the 5th grade when, weight=0.96, n represent word be sitting in branch layer node it is total Number, k represent branch's distance between different semantic items;S52), traversal obtains training sample entity word and query statement entity word, Calculate word between minimum semantic distance, the mean value of the minimum semantic distance of last query statement entity word and as semanteme away from From distsemOutput.
3. the short text similarity calculating method under vector semantic tensor according to claim 2 space, it is characterised in that: wordiWith wordjWhen code branch is identical, if coding ending is equal sign, Score=0.00, if coding ending is No. #, Then Score=0.5.
4. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that: In step S03, the similarity between short text, formula are characterized by Minkowski Distance are as follows:distvecIndicate vector distance, distsemIndicate that semantic distance, p are distance system Number.
5. the short text similarity calculating method under vector semantic tensor according to claim 4 space, it is characterised in that: The value of p is determined by way of cross validation, p is equal to 1 or 2.
6. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that: Statistical analysis includes training process and prediction process, and training process completes training set word entity word weight calculation, predicts process Be the weighted value that each entity word is calculated when user input query sentence, wherein training process specifically: S31), to nature Language Processing generate structured text successively carry out deletion duplicate keys, merge synonym processing, construct training set of words with And document keyword matrix DocList, wherein document keyword matrix every record includes a standard question sentence and its corresponding words Language list;S32), traversal calculates each word word in training set of wordsiTo contribution degree, that is, weight of problem, finally obtain Entity word weight set WordIDF, document keyword matrix DocList and entity word weight set WordIDF are statistical analysis The statistical data set obtained;Prediction process be when user input query sentence WordIDF set according to query statement In entity word extract entity weight.
7. the short text similarity calculating method under vector semantic tensor according to claim 6 space, it is characterised in that: The weight of each entity word in training set of words, specific formula are calculated using IDF algorithm are as follows:Wherein N is the problems in training set of words sum, niIt is contained for problem wordiThe problem of sum, 0.5 is harmonic coefficient.
8. the short text similarity calculating method under vector semantic tensor according to claim 7 space, it is characterised in that: Vector analysis includes training process and prediction process, and training process completes the degree of correlation of training word collection entity word weight and document Calculating and the generation of document vector, prediction process is to calculate its vector with training sample when user input query sentence Distance;Training process specifically: S41), traversal DocList and WordIDF, calculate DocList every record several pass Keyword and question sentence relevance score, obtain the degree of correlation set WordDocCoef of each word Yu each document, calculation formula Are as follows:
WordDocCoef(wordi,dj)=IDF (wordi)·R(wordi,dj) (2),
Wherein b, k1、k2For regulatory factor, fiFor wordiIn document djIn the frequency of occurrences,For document djIt is relatively long Degree, qfiFor wordiThe frequency occurred in inquiry document, document relative length calculation method are as follows: calculate each document first Then average length Avgl calculates the ratio of document average length and all document average lengths, the i.e. relative length of document Ratl, calculation formula are as follows: Represent document diTotal length,Represent document diNumber, N be training set total number of documents;S42), traversal document word matrix D ocList obtain document to WordIDF is carried out descending arrangement according to weight first and sets vector space dimension as M by amount, chooses M note before WordIDF Record initializes word vectors as building vector space entity keyword, using one-hot codingThen document word is traversed Language matrix D ocList obtains the entity word list of each document, and inquires in document word correlation matrix WordDocCoef The degree of correlation of entity word list and the document is indicated by Weight Acquisition sentence vectorFormula is as follows:Prediction process specifically: user, which inputs, completes structuring It is for statistical analysis after analysis, query statement and vector space entity key intersections are obtained, entity word vector sum word is based on File correlation matrix calculate query statement vectorization indicate, then calculate query statement vector sum training sample vector in advance away from From and complete normalized, then by normalized result or so vector distance distvecOutput.
9. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that: Natural language processing, which is completed to input user, carries out entity information extraction and multidimensional scale designation, export structure text;The reality Body information extraction is to realize that natural language segments using natural language processing basic fundamental, the meaningless word of filter high-frequency after participle It converges, retains specific entity information vocabulary and query vocabulary, be then based on the word list that participle obtains and carry out word expansion and return One changes;Multidimensional shield label is the part of speech label of the entity sets based on user dictionary, obtains crucial letter using syntactic analysis technology Grammatical relation between breath completes part-of-speech tagging and the part of speech correction of entity word.
10. the short text similarity calculating method under vector semantic tensor according to claim 1 space, feature exist In: the training dataset of input is certain industry or a certain field question set, including several records, each record include One typical problem and several Similar Problems, test sample are the text question of user's input.
CN201910607928.9A 2019-07-05 2019-07-05 Short text similarity calculating method under vector semantic tensor space Pending CN110347796A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910607928.9A CN110347796A (en) 2019-07-05 2019-07-05 Short text similarity calculating method under vector semantic tensor space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910607928.9A CN110347796A (en) 2019-07-05 2019-07-05 Short text similarity calculating method under vector semantic tensor space

Publications (1)

Publication Number Publication Date
CN110347796A true CN110347796A (en) 2019-10-18

Family

ID=68177949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910607928.9A Pending CN110347796A (en) 2019-07-05 2019-07-05 Short text similarity calculating method under vector semantic tensor space

Country Status (1)

Country Link
CN (1) CN110347796A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144129A (en) * 2019-12-26 2020-05-12 成都航天科工大数据研究院有限公司 Semantic similarity obtaining method based on autoregression and self-coding
CN111460556A (en) * 2020-04-01 2020-07-28 上海建工四建集团有限公司 Method and device for determining relevance between drawings, storage medium and terminal
CN112131341A (en) * 2020-08-24 2020-12-25 博锐尚格科技股份有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN112287080A (en) * 2020-10-23 2021-01-29 平安科技(深圳)有限公司 Question sentence rewriting method and device, computer equipment and storage medium
CN114254090A (en) * 2021-12-08 2022-03-29 马上消费金融股份有限公司 Question-answer knowledge base expansion method and device
CN114462413A (en) * 2022-02-16 2022-05-10 平安科技(深圳)有限公司 User entity matching method and device, computer equipment and readable storage medium
CN116521875A (en) * 2023-05-09 2023-08-01 江南大学 Prototype enhanced small sample dialogue emotion recognition method for introducing group emotion infection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
US20170011279A1 (en) * 2015-07-07 2017-01-12 Xerox Corporation Latent embeddings for word images and their semantics
JP2017162112A (en) * 2016-03-08 2017-09-14 日本電信電話株式会社 Word extraction device, method and program
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
US20170011279A1 (en) * 2015-07-07 2017-01-12 Xerox Corporation Latent embeddings for word images and their semantics
JP2017162112A (en) * 2016-03-08 2017-09-14 日本電信電話株式会社 Word extraction device, method and program
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张芳芳等: "基于字面和语义相关性匹配的智能篇章排序", 《山东大学学报(理学版)》 *
朴勇等: "基于张量的XML相似度计算方法", 《控制与决策》 *
陈宏朝等: "基于路径与深度的同义词词林词语相似度计算", 《中文信息学报》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144129A (en) * 2019-12-26 2020-05-12 成都航天科工大数据研究院有限公司 Semantic similarity obtaining method based on autoregression and self-coding
CN111144129B (en) * 2019-12-26 2023-06-06 成都航天科工大数据研究院有限公司 Semantic similarity acquisition method based on autoregressive and autoencoding
CN111460556A (en) * 2020-04-01 2020-07-28 上海建工四建集团有限公司 Method and device for determining relevance between drawings, storage medium and terminal
CN112131341A (en) * 2020-08-24 2020-12-25 博锐尚格科技股份有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN112287080A (en) * 2020-10-23 2021-01-29 平安科技(深圳)有限公司 Question sentence rewriting method and device, computer equipment and storage medium
CN112287080B (en) * 2020-10-23 2023-10-03 平安科技(深圳)有限公司 Method and device for rewriting problem statement, computer device and storage medium
CN114254090A (en) * 2021-12-08 2022-03-29 马上消费金融股份有限公司 Question-answer knowledge base expansion method and device
CN114462413A (en) * 2022-02-16 2022-05-10 平安科技(深圳)有限公司 User entity matching method and device, computer equipment and readable storage medium
CN114462413B (en) * 2022-02-16 2023-06-23 平安科技(深圳)有限公司 User entity matching method, device, computer equipment and readable storage medium
CN116521875A (en) * 2023-05-09 2023-08-01 江南大学 Prototype enhanced small sample dialogue emotion recognition method for introducing group emotion infection
CN116521875B (en) * 2023-05-09 2023-10-31 江南大学 Prototype enhanced small sample dialogue emotion recognition method for introducing group emotion infection

Similar Documents

Publication Publication Date Title
CN109858028B (en) Short text similarity calculation method based on probability model
CN110347796A (en) Short text similarity calculating method under vector semantic tensor space
Chieu et al. A maximum entropy approach to information extraction from semi-structured and free text
CN109960786A (en) Chinese Measurement of word similarity based on convergence strategy
CN110674252A (en) High-precision semantic search system for judicial domain
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN109783806A (en) A kind of text matching technique using semantic analytic structure
US20220207240A1 (en) System and method for analyzing similarity of natural language data
US20200073890A1 (en) Intelligent search platforms
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN117474703B (en) Topic intelligent recommendation method based on social network
CN114997288B (en) Design resource association method
Suleiman et al. Bag-of-concept based keyword extraction from Arabic documents
Jbara et al. Knowledge discovery in Al-Hadith using text classification algorithm
CN118245564B (en) Method and device for constructing feature comparison library supporting semantic review and repayment
Zhao et al. Keyword extraction for social media short text
CN114265936A (en) Method for realizing text mining of science and technology project
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN110990003A (en) API recommendation method based on word embedding technology
Mansour et al. Text vectorization method based on concept mining using clustering techniques
CN114580557A (en) Document similarity determination method and device based on semantic analysis
Dong et al. Knowledge graph construction of high-performance computing learning platform
Su et al. Automatic ontology population using deep learning for triple extraction
MalarSelvi et al. Analysis of Different Approaches for Automatic Text Summarization
Kyjánek et al. Constructing a lexical resource of Russian derivational morphology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191018

RJ01 Rejection of invention patent application after publication