CN110347796A - Short text similarity calculating method under vector semantic tensor space - Google Patents
Short text similarity calculating method under vector semantic tensor space Download PDFInfo
- Publication number
- CN110347796A CN110347796A CN201910607928.9A CN201910607928A CN110347796A CN 110347796 A CN110347796 A CN 110347796A CN 201910607928 A CN201910607928 A CN 201910607928A CN 110347796 A CN110347796 A CN 110347796A
- Authority
- CN
- China
- Prior art keywords
- word
- semantic
- vector
- document
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses the short text similarity calculating method under a kind of vector semantic tensor space, this method carries out natural language processing, statistical analysis and vector analysis to the training dataset of input or test sample first, obtains the vector distance of user query sentence and training sample;Then Semantic Come-back prediction is carried out to the statistical data set obtained by natural language processing, statistical analysis, obtains the semantic distance of user query sentence and training sample;Tensor space is finally established based on vector distance distribution and semantic distance distribution, the similarity between short text is being calculated in the tensor space of foundation.Document is alleviated or eliminated to the present invention, word different length influences similarity bring, while completing Text similarity computing under vector sum semantic tensor space for solving the problems, such as short text similarity calculation under condition of small sample.
Description
Technical field
A kind of short essay the present invention relates to short text similarity calculating method, under specifically a kind of vector semantic tensor space
This similarity calculating method belongs to field of artificial intelligence.
Background technique
The Intelligent dialogue system core is the accuracy of semantic understanding, and accuracy is higher, provides more accurate, the user of service
Experience is more preferable.But mainly there are two aspects for Chinese natural language semantic understanding difficulty, language is the mankind to objective things event
It is artificial abstract, so language is subjective and changeable, be especially apparent in the processing of Chinese;The information of another aspect language transmitting is past
Past and context-sensitive.The information sparsity and word randomness of short text further increase short text analysis processing simultaneously
Difficulty, on the other hand also embodying research short text analysis has highly important theory significance.
The prior art using Word2Vec realize text vector, Word2Vec to the quantity of training data, quality and
Domain knowledge completeness is more demanding.And specific industry data volume is usually not enough in actual environment, under normal circumstances can not
By third party's channel obtain quality data sample, cause in actual production environment or small sample scene under can not obtain
The text vectorization of high quality indicates, and then influences the accuracy of short text similarity.
Summary of the invention
The technical problem to be solved in the present invention is to provide the short text similarity calculations under a kind of vector semantic tensor space
Method, for solving the problems, such as short text similarity calculation under condition of small sample, alleviating or eliminating document, word different length to phase
It is influenced like degree bring, while completing Text similarity computing under vector sum semantic tensor space.
Technical problem in order to solve described problem, the technical solution adopted by the present invention is that: under vector semantic tensor space
Short text similarity calculating method, comprising the following steps: S01), nature is carried out to the training dataset or test sample of input
Language Processing, statistical analysis and vector analysis obtain the vector distance of user query sentence and training sample;S02), to step
Semantic Come-back prediction is carried out by the statistical data set that natural language processing, statistical analysis obtain in S01, obtains user query
The semantic distance of sentence and training sample;S03), the language obtained based on the step S01 vector distance distribution obtained and step S02
Adopted range distribution establishes tensor space, and the value of vector distance and semantic distance is in [0,1], then in the tensor space of foundation
Similarity between interior calculating short text.
Further, in step S02, semantic analysis prediction, the tool of semantic analysis prediction are carried out by Semantic Come-back model
Body process are as follows: S51), first initialize Semantic Come-back model, that is, obtain all training sample entity weight set WordIDF and exist
Semantic coding set in Chinese thesaurus then extracts all entity word Chinese thesaurus codings in user query sentence, so
Afterwards according to entity coding and as branch where leaf node, based on Semantic Come-back model calculate between word it is semantic away from
From Semantic Come-back model are as follows:Score
(wordi,wordj) indicate wordiWith wordjBetween semantic distance, wherein weight value is with wordiWith wordjIt compiles at place
Code branch rank is related, and when the code branch of two words is located at the 1st grade, weight=0.00, code branch is located at the 2nd grade
When, weight=0.65, when code branch is located at 3rd level, weight=0.80, when code branch is located at the 4th grade, weight=
0.90, when code branch is located at the 5th grade, weight=0.96, n represent the node total number that word is sitting in branch's layer, and k represents difference
Branch's distance between semantic item;S52), traversal obtains training sample entity word and query statement entity word, calculates between word
Minimum semantic distance, the mean value of the minimum semantic distance of last query statement entity word and as semantic distance distsemIt is defeated
Out.
Further, wordiWith wordjWhen code branch is identical, if coding ending is equal sign, Score=
0.00, if coding ending is No. #, Score=0.5.
Further, in step S03, the similarity between short text, formula are characterized by Minkowski Distance are as follows:(11), distvecIndicate vector distance, distsemIndicate semantic distance, p is distance coefficient.
Further, the value of p is determined by way of cross validation, p is equal to 1 or 2.
Further, statistical analysis includes training process and prediction process, and training process completes training set word entity word
Weight calculation, prediction process are the weighted values that each entity word is calculated when user input query sentence, wherein training process
Specifically: S31), to natural language processing generate structured text successively carry out deletion duplicate keys, merge synonym processing,
Training set of words and document keyword matrix DocList are constructed, wherein document keyword matrix every record includes one
Standard question sentence word list corresponding with its;S32), traversal calculates each word word in training set of wordsiContribution to problem
Degree is weight, finally obtains entity word weight set WordIDF, document keyword matrix DocList and entity word weight set
WordIDF is the statistical data set that statistical analysis obtains;Prediction process is when user input query sentence in WordIDF collection
The weight of entity is extracted in conjunction according to the entity word in query statement.
Further, the weight of each entity word in training set of words, specific formula are calculated using IDF algorithm are as follows:Wherein N is the problems in training set of words sum, niInclude for problem
WordiThe problem of sum, 0.5 is harmonic coefficient.
Further, vector analysis includes training process and prediction process, and training process completes training word collection entity word
The generation of the relatedness computation and document vector of weight and document, prediction process are calculated when user input query sentence
The vector distance of itself and training sample;Training process specifically: S41), traversal DocList and WordIDF, calculate DocList
Several keywords and question sentence relevance score of every record, obtain the degree of correlation set of each word Yu each document
WordDocCoef, calculation formula are as follows: WordDocCoef (wordi,dj)=IDF (wordi)·R(wordi,dj) (2),Wherein b, k1、k2For
Regulatory factor, fiFor wordiIn document djIn the frequency of occurrences,For document djRelative length, qfiFor wordiIt is inquiring
The frequency occurred in document, document relative length calculation method are as follows: then the average length Avgl for calculating each document first is counted
Calculate the ratio of document average length and all document average lengths, i.e. the relative length Ratl of document, calculation formula are as follows: Represent document diTotal length,Represent document di?
Number, N are training set total number of documents;S42), traversal document word matrix D ocList obtains document vector, first by WordIDF
Descending arrangement is carried out according to weight and sets vector space dimension as M, and M record is empty as building vector before choosing WordIDF
Between entity keyword, using one-hot coding initialize word vectorsThen traversal document word matrix D ocList is obtained
The entity word list of each document is taken, and in document word correlation matrix WordDocCoef query entity word list and is somebody's turn to do
The degree of correlation of document is indicated by Weight Acquisition sentence vectorFormula is as follows:(8);Prediction process specifically: user, which inputs, completes structured analysis
It is for statistical analysis afterwards, query statement and vector space entity key intersections are obtained, entity word vector sum word document is based on
Correlation matrix calculates query statement vectorization and indicates, then calculates query statement vector sum training sample vector in advance apart from simultaneously
Normalized is completed, then by normalized result or so vector distance distvecOutput.
Further, natural language processing, which is completed to input user, carries out entity information extraction and multidimensional scale designation, output
Structured text;The entity information extraction is to realize that natural language segments using natural language processing basic fundamental, after participle
Filter high-frequency nonsense words retain specific entity information vocabulary and query vocabulary, are then based on the word list that participle obtains
Carry out word expansion and normalization;Multidimensional shield label is the part of speech label of the entity sets based on user dictionary, utilizes syntax point
Analysis technology obtains the grammatical relation between key message, completes part-of-speech tagging and the part of speech correction of entity word.
Further, the training dataset of input is certain industry or a certain field question set, including several notes
Record, each record include a typical problem and several Similar Problems, and test sample is the text question of user's input.
Beneficial effects of the present invention: the present invention endeavours to solve the problems, such as short text similarity calculation under condition of small sample;Vector
Short text similarity calculating method under semantic tensor space, compared to word incorporation model, probabilistic model to sample size and
Quality requirement is not high, has better practicability and applicability in actual production practice;Compared to single probabilistic model, draw
Enter semantics distance to carry out improving model computational accuracy apart from correction;Compared to traditional pure semantics model, mathematical modulo is introduced
Type reduces complexity, provides model generalization ability;Compared to traditional vector space model, introduces weight coefficient b and be used to control
Influence of the Document Length to text similarity;Introduce weight coefficient k1Influence of the different word frequency to text similarity is modulated in control,
Improve the accuracy of short text similarity.
Detailed description of the invention
Fig. 1 is the flow chart of this method.
Specific embodiment
The present invention is further illustrated in the following with reference to the drawings and specific embodiments.
Embodiment 1
The present embodiment discloses the short text similarity calculating method under a kind of vector semantic tensor space, as shown in Figure 1, packet
Include following steps:
S01), training dataset or test sample are obtained using input mould group.
Training dataset is certain industry or a certain field question set, and comprising several records, every record includes
One typical problem and several Similar Problems.Test sample is the text question of user's input.
Training dataset does not limit source, and source includes disk document, structuring and unstructured database, webpage,
Forum etc..The data of separate sources are standardized in input module, are processed into a typical problem and multiple similar are asked
The form of sentence.Sentence is carried out simultaneously and artificially carries out semantic expansion, enhances the generalization ability of training set, output standardization text.Rule
Generalized text is problem set and corresponding similar question sentence set.
S02), natural language processing mould group performance specification text completes extraction and the multidimensional scale designation of entity information, defeated
Structured text out.
The entity extraction is to realize that natural language segments using natural language processing basic fundamental.Filter high-frequency after participle
Nonsense words, for example, I and punctuation mark etc., retain specific entity information vocabulary and query vocabulary, finally obtain
The word collection of natural language text;It is then based on the word list that participle obtains and carries out word expansion and normalization.Specific method
Including but not limited to realize that efficient word figure scans based on Trie tree construction, Chinese character is all in generation sentence may be at word situation institute
The directed acyclic graph (DAG) of composition;Maximum probability path is searched using Dynamic Programming, finds out the maximum cutting group based on word frequency
It closes;Word for being not logged in finds out most possible group using Viterbi algorithm using the HMM model based on Chinese character at word ability
It closes.
The multidimensional scale designation refers to the part of speech label of the entity sets based on user dictionary, utilizes syntactic analysis skill
Art, such as finite graph analytic approach, phrase structure analysis, complete grammer, local grammer and dependency analysis etc. obtain key message
Between grammer dependence, modified relationship etc., complete entity word part-of-speech tagging and part of speech correction.
In the present embodiment, the structured text of natural language processing mould group output includes the entity word set of natural language text
Conjunction and corresponding part of speech set.
S03), the structured text that statistical analysis mould group exports natural language processing mould group is for statistical analysis, including
Training process and prediction process.Wherein training process mainly completes training set word entity word weight calculation.Prediction process is to work as
The weighted value of each entity word in query statement is calculated when user input query sentence.
In the present embodiment, the training process of statistical analysis is as follows:
S31), it is directed to structured text, successively by deleting duplicate keys, merges the processes such as synonym and constructs training word collection
Conjunction and document keyword matrix DocList, wherein document keyword matrix every record includes a standard question sentence and its phase
Answer word list.
S32), traversal calculates each word word in set of wordsiTo contribution degree, that is, weight of problem.The calculating side of weight
There are many kinds of methods, and the present embodiment uses IDF algorithm, finally obtains term weighing WordIDF set.
Calculation formula are as follows:
Wherein N is the problems in training set sum, niTo contain word in problemiThe problem of sum, wherein 0.5 is
Harmonic coefficient, log function are to allow the value of IDF by N and niInfluence it is more smooth.
The statistical data set of final training output includes term weighing set WordIDF and document keyword matrix
DocList。
The prediction process of statistical analysis be when user input query sentence WordIDF set according in query statement
Entity word extract entity weight.
S04), vector analysis mould group to statistical analysis mould group output data carry out vector analysis, including training process and
Prediction process.Wherein training process mainly completes the relatedness computation and document vector of training set entity word weight and document
It generates.Prediction process is that the vector distance for inquiring itself and training sample is calculated when user input query sentence.
Wherein the training process of vector analysis is as follows: S41), traverse training aggregated document keyword DocList and
WordIDF calculates several keywords and question sentence relevance score of DocList every record, obtains each word and each
The degree of correlation set WordDocCoef of document, calculation formula are as follows:
WordDocCoef(wordi,dj)=IDF (wordi)·R(wordi,dj) (2),
Wherein b, k1、k2For regulatory factor, it is arranged generally according to experience, general b=0.75, k1=2, fiFor wordiIn text
Shelves djIn the frequency of occurrences,For document djRelative length, qfiFor wordiThe frequency occurred in inquiry document is big absolutely
Under partial picture, wordiIt only will appear primary, i.e. qf in queriesi=1, therefore formula 4 simplifies are as follows:
Document relative lengthCalculation method are as follows: calculate the average length Avgl of each document first, then calculate text
The ratio of shelves average length and all document average lengths, i.e. the relative length Ratl of document.Document relative length is for correcting
Its influence to short text similarity.Calculation formula are as follows:
WhereinRepresent document diTotal length,Represent document diNumber, N be training set total number of documents.
S42), traversal document word matrix D ocList obtains document vector, first drops WordIDF according to weight
Sequence arranges and sets vector space dimension as M, and M record is adopted as building vector space entity keyword before choosing WordIDF
Word vectors are initialized with one-hot codingVector dimension value non-zero i.e. 1;Then document word matrix D ocList is traversed
Obtain the entity word list of each document, and document word correlation matrix WordDocCoef query entity word list with
The degree of correlation of the document is indicated by Weight Acquisition sentence vectorFormula is as follows:
In the present embodiment, the prediction process of vector analysis is as follows: user carries out statistical after inputting completion structured analysis
Analysis obtains query statement and vector space entity key intersections, based on entity word vector sum word file correlation matrix
Calculating query statement vectorization indicates, then calculates query statement vector sum training sample vector COS distance and completes at normalization
Reason, finally using normalized result as vector distance distvecOutput.
S05), Semantic Come-back mould group carries out Semantic Come-back prediction, and prediction process is to calculate user input query sentence and instruction
Practice the semantic distance between sample.
In the present embodiment, Semantic Come-back model prediction process is as follows: S51), initializing Semantic Come-back model first, that is, obtains
Semantic coding set of all training sample entity weight set WordIDF in Chinese thesaurus is taken, inquiry language is then extracted
All entity word Chinese thesaurus codings are based on language then according to entity coding and as branch where leaf node in sentence
Adopted regression model calculates the semantic distance between word.Semantic Come-back model is as follows:
Score(wordi,wordj) indicate wordiWith wordjBetween semantic distance, wherein weight value is with wordi
With wordjPlace code branch rank is related, when the code branch of two words is located at the 1st grade, weight=0.00, and coding
When branch is located at the 2nd grade, weight=0.65, when code branch is located at 3rd level, weight=0.80, code branch is located at the 4th
When grade, weight=0.90, when code branch is located at the 5th grade, weight=0.96, n represent the node that word is sitting in branch's layer
Sum, k represent branch's distance between different semantic items.If coding ending is "=" when code branch is identical, then
Score=0.00, if ending is " # " so Score=0.5.
Branch distance k where word between the node total number n and non-synonymity of branch's layer influences the similar of the senses of a dictionary entry
Degree.For example:
Aa01A01=personage personage personage people native to a place person
The Aa01A02=mankind stranger whole mankind
Aa01A03=manpower personnel's population population mouthful index finger
Aa01A04=labour labour worker
Aa01A05=ordinary man is personal
Personage and personnel's same branches are Aa01A, with code branch totally 5 of Aa01A beginning, therefore n=5;Personage and people
Member where branch's rank be 5 grades, wherein coding in digital level every two represent a rank, so weight=0.96;People
The branch of scholar and personnel distance are 2, therefore k=2
Then
S52), traversal obtains training sample entity word and query statement entity word, calculate minimum semanteme between word away from
From, the mean value of the minimum semantic distance of last query statement entity word and as semantic distance distsecOutput.
S06), under tensor space similarity calculation module calculate short text similarity
It is primarily based on the distribution of sentence vector distance and tensor space is established in semantic distance distribution, wherein vector distance and language
Vector distance of the value of adopted distance between [0,1], sentence is from the semanteme between angle of statistics characterization sentence, and sentence
Between semantic sentence then represent the difference between sentence from semantics angle, the semantic distance of two different dimensions can be mutually complementary
It fills, therefore uses integrated study thought, the two is placed under same tensor space and carries out comprehensive consideration;Phase between final short text
It is characterized like degree by the Minkowski Distance under tensor space.Formula is as follows:
distvecIndicate vector distance, distsemIndicate semantic distance, p is distance coefficient.In actual application may be used
It is specific according to specific business datum, the value of p is determined by way of cross validation.General classics value is 1 or 2.Value is 1
When distance be city distance or manhatton distance;Distance is classical Euclidean distance when value is 2.
Described above is only basic principle and preferred embodiment of the invention, and those skilled in the art do according to the present invention
Improvement and replacement out, belong to the scope of protection of the present invention.
Claims (10)
1. the short text similarity calculating method under vector semantic tensor space, it is characterised in that: the following steps are included: S01),
Training dataset or test sample to input carry out natural language processing, statistical analysis and vector analysis, obtain user and look into
Ask the vector distance of sentence and training sample;S02), to the system for passing through natural language processing in step S01, statistical analysis obtains
It counts set and carries out Semantic Come-back prediction, obtain the semantic distance of user query sentence and training sample;S03), it is based on step
Tensor space is established in the semantic distance distribution that the distribution of vector distance that S01 is obtained and step S02 are obtained, vector distance and it is semantic away from
From value in [0,1], then in the tensor space of foundation calculate short text between similarity.
2. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that:
In step S02, pass through Semantic Come-back model carry out semantic analysis prediction, semantic analysis prediction detailed process are as follows: S51), first
Semantic Come-back model is initialized, that is, obtains semanteme of all training sample entity weight set WordIDF in Chinese thesaurus
Code set then extracts all entity word Chinese thesaurus coding in user query sentence, then according to entity coding and
As branch where leaf node, the semantic distance between word, Semantic Come-back model are calculated based on Semantic Come-back model are as follows:
Score(wordi,wordj) indicate wordiWith wordjBetween semantic distance, wherein weight value is with wordiWith
wordjPlace code branch rank is related, when the code branch of two words is located at the 1st grade, weight=0.00, and coding point
When branch is located at the 2nd grade, weight=0.65, when code branch is located at 3rd level, weight=0.80, code branch is located at the 4th grade
When, weight=0.90, code branch be located at the 5th grade when, weight=0.96, n represent word be sitting in branch layer node it is total
Number, k represent branch's distance between different semantic items;S52), traversal obtains training sample entity word and query statement entity word,
Calculate word between minimum semantic distance, the mean value of the minimum semantic distance of last query statement entity word and as semanteme away from
From distsemOutput.
3. the short text similarity calculating method under vector semantic tensor according to claim 2 space, it is characterised in that:
wordiWith wordjWhen code branch is identical, if coding ending is equal sign, Score=0.00, if coding ending is No. #,
Then Score=0.5.
4. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that:
In step S03, the similarity between short text, formula are characterized by Minkowski Distance are as follows:distvecIndicate vector distance, distsemIndicate that semantic distance, p are distance system
Number.
5. the short text similarity calculating method under vector semantic tensor according to claim 4 space, it is characterised in that:
The value of p is determined by way of cross validation, p is equal to 1 or 2.
6. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that:
Statistical analysis includes training process and prediction process, and training process completes training set word entity word weight calculation, predicts process
Be the weighted value that each entity word is calculated when user input query sentence, wherein training process specifically: S31), to nature
Language Processing generate structured text successively carry out deletion duplicate keys, merge synonym processing, construct training set of words with
And document keyword matrix DocList, wherein document keyword matrix every record includes a standard question sentence and its corresponding words
Language list;S32), traversal calculates each word word in training set of wordsiTo contribution degree, that is, weight of problem, finally obtain
Entity word weight set WordIDF, document keyword matrix DocList and entity word weight set WordIDF are statistical analysis
The statistical data set obtained;Prediction process be when user input query sentence WordIDF set according to query statement
In entity word extract entity weight.
7. the short text similarity calculating method under vector semantic tensor according to claim 6 space, it is characterised in that:
The weight of each entity word in training set of words, specific formula are calculated using IDF algorithm are as follows:Wherein N is the problems in training set of words sum, niIt is contained for problem
wordiThe problem of sum, 0.5 is harmonic coefficient.
8. the short text similarity calculating method under vector semantic tensor according to claim 7 space, it is characterised in that:
Vector analysis includes training process and prediction process, and training process completes the degree of correlation of training word collection entity word weight and document
Calculating and the generation of document vector, prediction process is to calculate its vector with training sample when user input query sentence
Distance;Training process specifically: S41), traversal DocList and WordIDF, calculate DocList every record several pass
Keyword and question sentence relevance score, obtain the degree of correlation set WordDocCoef of each word Yu each document, calculation formula
Are as follows:
WordDocCoef(wordi,dj)=IDF (wordi)·R(wordi,dj) (2),
Wherein b, k1、k2For regulatory factor, fiFor wordiIn document djIn the frequency of occurrences,For document djIt is relatively long
Degree, qfiFor wordiThe frequency occurred in inquiry document, document relative length calculation method are as follows: calculate each document first
Then average length Avgl calculates the ratio of document average length and all document average lengths, the i.e. relative length of document
Ratl, calculation formula are as follows: Represent document diTotal length,Represent document diNumber, N be training set total number of documents;S42), traversal document word matrix D ocList obtain document to
WordIDF is carried out descending arrangement according to weight first and sets vector space dimension as M by amount, chooses M note before WordIDF
Record initializes word vectors as building vector space entity keyword, using one-hot codingThen document word is traversed
Language matrix D ocList obtains the entity word list of each document, and inquires in document word correlation matrix WordDocCoef
The degree of correlation of entity word list and the document is indicated by Weight Acquisition sentence vectorFormula is as follows:Prediction process specifically: user, which inputs, completes structuring
It is for statistical analysis after analysis, query statement and vector space entity key intersections are obtained, entity word vector sum word is based on
File correlation matrix calculate query statement vectorization indicate, then calculate query statement vector sum training sample vector in advance away from
From and complete normalized, then by normalized result or so vector distance distvecOutput.
9. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that:
Natural language processing, which is completed to input user, carries out entity information extraction and multidimensional scale designation, export structure text;The reality
Body information extraction is to realize that natural language segments using natural language processing basic fundamental, the meaningless word of filter high-frequency after participle
It converges, retains specific entity information vocabulary and query vocabulary, be then based on the word list that participle obtains and carry out word expansion and return
One changes;Multidimensional shield label is the part of speech label of the entity sets based on user dictionary, obtains crucial letter using syntactic analysis technology
Grammatical relation between breath completes part-of-speech tagging and the part of speech correction of entity word.
10. the short text similarity calculating method under vector semantic tensor according to claim 1 space, feature exist
In: the training dataset of input is certain industry or a certain field question set, including several records, each record include
One typical problem and several Similar Problems, test sample are the text question of user's input.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910607928.9A CN110347796A (en) | 2019-07-05 | 2019-07-05 | Short text similarity calculating method under vector semantic tensor space |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910607928.9A CN110347796A (en) | 2019-07-05 | 2019-07-05 | Short text similarity calculating method under vector semantic tensor space |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110347796A true CN110347796A (en) | 2019-10-18 |
Family
ID=68177949
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910607928.9A Pending CN110347796A (en) | 2019-07-05 | 2019-07-05 | Short text similarity calculating method under vector semantic tensor space |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110347796A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111144129A (en) * | 2019-12-26 | 2020-05-12 | 成都航天科工大数据研究院有限公司 | Semantic similarity obtaining method based on autoregression and self-coding |
CN111460556A (en) * | 2020-04-01 | 2020-07-28 | 上海建工四建集团有限公司 | Method and device for determining relevance between drawings, storage medium and terminal |
CN112131341A (en) * | 2020-08-24 | 2020-12-25 | 博锐尚格科技股份有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
CN112287080A (en) * | 2020-10-23 | 2021-01-29 | 平安科技(深圳)有限公司 | Question sentence rewriting method and device, computer equipment and storage medium |
CN114254090A (en) * | 2021-12-08 | 2022-03-29 | 马上消费金融股份有限公司 | Question-answer knowledge base expansion method and device |
CN114462413A (en) * | 2022-02-16 | 2022-05-10 | 平安科技(深圳)有限公司 | User entity matching method and device, computer equipment and readable storage medium |
CN116521875A (en) * | 2023-05-09 | 2023-08-01 | 江南大学 | Prototype enhanced small sample dialogue emotion recognition method for introducing group emotion infection |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
US20170011279A1 (en) * | 2015-07-07 | 2017-01-12 | Xerox Corporation | Latent embeddings for word images and their semantics |
JP2017162112A (en) * | 2016-03-08 | 2017-09-14 | 日本電信電話株式会社 | Word extraction device, method and program |
CN109858028A (en) * | 2019-01-30 | 2019-06-07 | 神思电子技术股份有限公司 | A kind of short text similarity calculating method based on probabilistic model |
-
2019
- 2019-07-05 CN CN201910607928.9A patent/CN110347796A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
US20170011279A1 (en) * | 2015-07-07 | 2017-01-12 | Xerox Corporation | Latent embeddings for word images and their semantics |
JP2017162112A (en) * | 2016-03-08 | 2017-09-14 | 日本電信電話株式会社 | Word extraction device, method and program |
CN109858028A (en) * | 2019-01-30 | 2019-06-07 | 神思电子技术股份有限公司 | A kind of short text similarity calculating method based on probabilistic model |
Non-Patent Citations (3)
Title |
---|
张芳芳等: "基于字面和语义相关性匹配的智能篇章排序", 《山东大学学报(理学版)》 * |
朴勇等: "基于张量的XML相似度计算方法", 《控制与决策》 * |
陈宏朝等: "基于路径与深度的同义词词林词语相似度计算", 《中文信息学报》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111144129A (en) * | 2019-12-26 | 2020-05-12 | 成都航天科工大数据研究院有限公司 | Semantic similarity obtaining method based on autoregression and self-coding |
CN111144129B (en) * | 2019-12-26 | 2023-06-06 | 成都航天科工大数据研究院有限公司 | Semantic similarity acquisition method based on autoregressive and autoencoding |
CN111460556A (en) * | 2020-04-01 | 2020-07-28 | 上海建工四建集团有限公司 | Method and device for determining relevance between drawings, storage medium and terminal |
CN112131341A (en) * | 2020-08-24 | 2020-12-25 | 博锐尚格科技股份有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
CN112287080A (en) * | 2020-10-23 | 2021-01-29 | 平安科技(深圳)有限公司 | Question sentence rewriting method and device, computer equipment and storage medium |
CN112287080B (en) * | 2020-10-23 | 2023-10-03 | 平安科技(深圳)有限公司 | Method and device for rewriting problem statement, computer device and storage medium |
CN114254090A (en) * | 2021-12-08 | 2022-03-29 | 马上消费金融股份有限公司 | Question-answer knowledge base expansion method and device |
CN114462413A (en) * | 2022-02-16 | 2022-05-10 | 平安科技(深圳)有限公司 | User entity matching method and device, computer equipment and readable storage medium |
CN114462413B (en) * | 2022-02-16 | 2023-06-23 | 平安科技(深圳)有限公司 | User entity matching method, device, computer equipment and readable storage medium |
CN116521875A (en) * | 2023-05-09 | 2023-08-01 | 江南大学 | Prototype enhanced small sample dialogue emotion recognition method for introducing group emotion infection |
CN116521875B (en) * | 2023-05-09 | 2023-10-31 | 江南大学 | Prototype enhanced small sample dialogue emotion recognition method for introducing group emotion infection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109858028B (en) | Short text similarity calculation method based on probability model | |
CN110347796A (en) | Short text similarity calculating method under vector semantic tensor space | |
Chieu et al. | A maximum entropy approach to information extraction from semi-structured and free text | |
CN109960786A (en) | Chinese Measurement of word similarity based on convergence strategy | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
CN114254653A (en) | Scientific and technological project text semantic extraction and representation analysis method | |
CN109783806A (en) | A kind of text matching technique using semantic analytic structure | |
US20220207240A1 (en) | System and method for analyzing similarity of natural language data | |
US20200073890A1 (en) | Intelligent search platforms | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
CN117474703B (en) | Topic intelligent recommendation method based on social network | |
CN114997288B (en) | Design resource association method | |
Suleiman et al. | Bag-of-concept based keyword extraction from Arabic documents | |
Jbara et al. | Knowledge discovery in Al-Hadith using text classification algorithm | |
CN118245564B (en) | Method and device for constructing feature comparison library supporting semantic review and repayment | |
Zhao et al. | Keyword extraction for social media short text | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN110990003A (en) | API recommendation method based on word embedding technology | |
Mansour et al. | Text vectorization method based on concept mining using clustering techniques | |
CN114580557A (en) | Document similarity determination method and device based on semantic analysis | |
Dong et al. | Knowledge graph construction of high-performance computing learning platform | |
Su et al. | Automatic ontology population using deep learning for triple extraction | |
MalarSelvi et al. | Analysis of Different Approaches for Automatic Text Summarization | |
Kyjánek et al. | Constructing a lexical resource of Russian derivational morphology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191018 |
|
RJ01 | Rejection of invention patent application after publication |