CN112988970A - Text matching algorithm serving intelligent question-answering system - Google Patents

Text matching algorithm serving intelligent question-answering system Download PDF

Info

Publication number
CN112988970A
CN112988970A CN202110267040.2A CN202110267040A CN112988970A CN 112988970 A CN112988970 A CN 112988970A CN 202110267040 A CN202110267040 A CN 202110267040A CN 112988970 A CN112988970 A CN 112988970A
Authority
CN
China
Prior art keywords
word
model
question
similarity
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110267040.2A
Other languages
Chinese (zh)
Inventor
励建科
许化
顾淼
陈再蝶
朱晓秋
樊伟东
邓明明
周杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Kangxu Technology Co ltd
Original Assignee
Zhejiang Kangxu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Kangxu Technology Co ltd filed Critical Zhejiang Kangxu Technology Co ltd
Priority to CN202110267040.2A priority Critical patent/CN112988970A/en
Publication of CN112988970A publication Critical patent/CN112988970A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text matching algorithm serving an intelligent question and answer system, which comprises a question and answer library text data set, an optimized jieba word splitter, a trained word2vector model and a modified cosine similarity model. In the invention, an optimized jieba word splitter is obtained by combining the advantages of a precise mode and a search mode of the jieba word splitting, after Chinese word splitting is carried out on a 'consultation problem', word vector embedding is carried out through a word2vector model, Chinese word splitting is converted into a word vector which can be calculated, cosine similarity model calculation after correction is carried out on the word vector, the precision of similarity calculation is improved, text similarity calculation is realized, finally, the similarity is sequenced, a similarity threshold is given, and a 'fixed problem' which has the highest similarity calculation value and exceeds the given similarity threshold in a text data set of a question-answer library and a corresponding 'fixed answer' are selected as a question-answer pair of the 'consultation problem'.

Description

Text matching algorithm serving intelligent question-answering system
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text matching algorithm serving an intelligent question-answering system.
Background
NLP (Natural Language Processing) is an important direction in the fields of computer science and artificial intelligence, and is a field in which computer science, artificial intelligence and linguistics focus on the interaction between computers and human (Natural) languages, and it studies various theories and methods that can realize effective communication between people and computers using Natural languages, and Natural Language Processing is a science integrating linguistics, computer science and mathematics, and thus, the study in this field will relate to Natural languages, i.e. languages used by people daily, it is closely related to, but significantly different from, the study of linguistics, natural language processing is not a general study of natural language, but rather to develop computer systems, and particularly software systems therein, that can efficiently implement natural language communications, and thus are part of computer science.
The invention provides a text matching algorithm serving an intelligent question-answering system, which is used for performing text similarity calculation through Chinese word segmentation and word vector embedding and trying to search a question-answer pair closest to a question in a question-answering library.
Disclosure of Invention
To solve the above-mentioned problems in the background art, a text matching algorithm serving an intelligent question-answering system is proposed.
In order to achieve the purpose, the invention adopts the following technical scheme:
a text matching algorithm serving an intelligent question-answering system comprises a question-answering library text data set, an optimized jieba word segmentation device, a trained word2vector model and a modified cosine similarity model;
the question-answer library text data set comprises a fixed question set and a fixed answer set corresponding to the fixed question set;
the optimized jieba word segmentation device comprises an accurate mode and a search mode;
training a text data set of a question and answer library through a CBOW model to obtain a trained word2vecor model;
the modified cosine similarity calculation model is as follows:
Figure BDA0002972498380000021
wherein i and j are vectors of the same dimension,
Figure BDA0002972498380000022
vector means, R, representing the ith and jth vectorsu,iThe u-th vector value, R, representing the input vector iu,jRepresents the u-th vector value of the input vector j;
the text matching algorithm comprises the following steps:
s1, inputting 'consultation questions' in the intelligent question answering system;
s2, performing Chinese word segmentation on the 'consultation problem' through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with unified dimensionality;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, sorting the similarity calculation values in descending order, comparing the similarity calculation values with a given similarity threshold value of the fixed questions in the fixed question set, and selecting the fixed questions and the corresponding fixed answers with the similarity calculation values being the largest or larger than the similarity threshold value as the consultation answers of the consultation questions.
As a further description of the above technical solution:
the given similarity threshold calculation method comprises the following steps:
s1, inputting a 'fixed question' in the intelligent question answering system;
s2, performing Chinese word segmentation on the fixed problem through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with fixed dimensions;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, the similarity calculation values are sorted in descending order, and the similarity calculation value with the largest similarity calculation value is selected as the given similarity threshold.
As a further description of the above technical solution:
the word segmentation mode of the jieba word segmentation device which is not optimized comprises an accurate mode.
As a further description of the above technical solution:
the word segmentation mode of the jieba word segmentation device which is not optimized comprises a full mode.
As a further description of the above technical solution:
the word segmentation mode of the jieba word segmenter which is not optimized comprises a search mode.
As a further description of the above technical solution:
the CBOW model expresses input Chinese participles into 256-dimensional word vectors.
As a further description of the above technical solution:
the unmodified cosine similarity model is:
Figure BDA0002972498380000041
wherein x and y are two n-dimensional vectors for calculating the similarity.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. in the invention, an optimized jieba word splitter is obtained by combining the advantages of a precise mode and a search mode of the jieba word splitting, the problem that a fixed longest word in the precise mode is prior to information overlapping in the search mode is solved, after Chinese word splitting is carried out on a query problem, word vector embedding is carried out through a word2vector model, Chinese word splitting is converted into a word vector which can be calculated, cosine similarity model calculation after correction is carried out on the word vector, the precision of similarity calculation is improved, text similarity calculation is realized, finally the similarity is sequenced, a similarity threshold is given, and a fixed problem and a corresponding fixed answer which have the highest similarity calculation value and exceed the given similarity threshold in a database text data set are selected as a question-answer pair of the query problem.
2. In the invention, the used Chinese word segmentation skill can combine the advantages of the accurate mode and the search mode of the jieba word segmentation device to obtain the optimized jieba word segmentation device, but the respective disadvantages of the accurate mode and the search mode are abandoned, the problems of longest word priority and information overlapping are solved, the Chinese word segmentation with high efficiency in industry is realized, and the Chinese word segmentation device is a very good choice under the condition that the cost-performance ratio of the training word segmentation model is insufficient or the corpus data is not complete.
3. In the invention, the word vector expression selects the word2vector model trained by the CBOW model, and the obtained word vector can express the self semantics of the words by considering the context information near the words, including the influence of other words and adjacent words around each word on the word meaning of the word and the influence of the relation between the words on the semantics of the sentence.
4. In the invention, a modified cosine similarity model is used, the original cosine similarity model is improved, the mean value of each dimension is subtracted, and the difference of dimensions of each dimension is considered.
Drawings
FIG. 1 is a flow chart illustrating a text matching algorithm serving an intelligent question answering system according to an embodiment of the present invention;
FIG. 2 shows a schematic diagram of a model of a Skip-gram provided according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a CBOW model provided according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating two vectors a, b provided according to an embodiment of the present invention that are similar in cosine;
FIG. 5 is a diagram illustrating two vectors a, b provided according to an embodiment of the present invention with the cosine of the vectors a, b being equal;
fig. 6 is a schematic diagram illustrating that two vectors a, b provided according to an embodiment of the present invention are dissimilar in cosine.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1-6, the present invention provides a technical solution: a text matching algorithm serving an intelligent question-answering system comprises a question-answering library text data set, an optimized jieba word segmentation device, a trained word2vector model and a modified cosine similarity model;
the text data set of the question-answer library comprises a fixed question set and a fixed answer set corresponding to the fixed question set;
specifically, the corpus text dataset is shown in table 1 below:
TABLE 1 questionnaire text training set
Figure BDA0002972498380000061
The optimized jieba word segmentation device comprises an accurate mode and a search mode;
the word segmentation modes of the unoptimized jieba word segmentation device comprise three independent word segmentation modes, namely an accurate mode, a full mode and a search mode;
the accurate mode is that the sentence is tried to be cut open most accurately and is suitable for text analysis;
full mode: all words which can be formed into words in the sentence are scanned out, so that the speed is very high, but ambiguity cannot be solved;
a search mode: on the basis of an accurate mode, segmenting long words and rephrases, improving the recall rate and being suitable for word segmentation of a search engine;
specifically, assuming that there is a scene in which a fund manager is in a text data set of a questioning and answering library and a financial manager is not in the text data set, if a jieba word segmentation device which is not optimized is used for Chinese word segmentation in the scene, the following two word segmentation results are obtained, wherein the results of a full mode and a search mode are similar;
word segmentation result in the precision mode one: since the exact mode is the longest word first, we get the results "fund manager" and "financing manager";
and a second word segmentation result in the full mode and the search mode: the results obtained are "fund", "manager", "fund manager" and "financing", "manager", "financing manager".
In an actual application scenario, because the text data set of the questioning and answering library has the vocabulary of "fund manager" but not the "financing manager", the vocabulary of "financing manager" is not learned in the word2vector model, and therefore the word2vector model cannot recognize the vocabulary of "financing manager";
in this case, if we select the jieba word segmenter with the accurate mode, since the accurate mode is the longest word priority, our word2vector model can obtain the word vector of "fund manager" but cannot obtain the word vector of "finance manager";
if we select the jieba word segmentation device in the search mode, we can obtain three word segmentation devices of ' financing manager ', manager ' and ' financing manager ', although the word2vector model cannot recognize the word of ' financing manager ', the word vectors of ' financing ' and ' manager ' can be recognized to replace the meaning expressed by the ' financing manager ', but for ' fund manager ', in the jieba word segmentation device in the search mode, the word segmentation devices which can be recognized by the word2vector models of ' fund ', ' manager ' and ' fund manager ' are obtained, in this case, the two word vectors of ' fund ' and ' manager ' are redundant, and information overlapping is caused, and the word vector expression of the whole sentence is influenced when the word vectors are put in the text;
as shown above, for two vocabularies, namely "fund manager" and "financing manager", if we select the accurate mode of the jieba segmenter, our word vector model can process "fund manager", but because the default longest word is first in this mode and "financing manager" cannot be processed, if we select the search mode of the jieba segmenter, our word vector model can process "financing manager", but information overlapping will be caused when processing "fund manager", therefore, the invention considers that the accurate mode and the search mode of the segmenter are cooperatively used to obtain the optimized jieba segmenter to solve the problem that the longest word is first overlapped with the information, for each text, we select the accurate mode of the jieba segmenter to perform segmentation first, perform word2vector conversion on the segmented vocabulary, and then perform unrecognizable vocabulary which is not learned by the word2vector model, then, carrying out a search mode on the vocabulary and then segmenting words;
specifically, the application examples of the fund manager and the financing manager are also used, and the optimized jieba word segmentation device is used for Chinese word segmentation recognition:
for the fund manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager due to the fact that the word2vector model can be identified, for the fund manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager, for the financial manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager due to the fact that the word2vector model can not be identified, the word vector of the fund manager can be obtained to carry out searching mode re-segmentation again to obtain three words of financial, manager and financial manager, wherein the word vector of the financial manager can not be identified to directly skip, the word vector of the financial manager and the word vector of the manager can be obtained to jointly express the Chinese meaning of the financial manager, and therefore the effect of word segmentation can be greatly optimized through the flexible application of the optimized fund classifier, the efficiency is high and the accuracy is not lost;
training a text data set of a question and answer library through a CBOW model to obtain a trained word2vecor model, and specifically, expressing input Chinese word segmentation into 256-dimensional word vectors through the CBOW model;
the word2vector model network is composed of two layers of weights, a hidden layer is composed of n neurons, wherein n represents the vector dimension of a word, the vector dimension is unified into 256 dimensions, an input layer and an output layer both comprise M neurons, M is the total number of Chinese participles in a vocabulary table, an output layer activation function is a softmax function commonly used in the classification problem, and on the basis, two training methods of Skip-gram and COBW are provided:
as shown in fig. 2, the Skip-gram model predicts surrounding words according to a central word, takes the unique heat vector of the central word as an input vector and the unique heat vector of the surrounding words as an output to construct a training sample pair, and calculates the probability that the output word is the surrounding words by an output layer softmax function;
as shown in fig. 3, the COBW model predicts a central word according to surrounding words, creates a sum of unique heat vectors of the surrounding words as an input vector, and the unique heat vector of the central word as an output to construct a training sample pair, and derives a node word with the maximum probability as an output by an output layer softmax function;
after the neural network training is finished, the trained network weight can be used for representing semantics, one row in the weight matrix represents one word in a vocabulary table of a corpus through the conversion of a vocabulary entry unique heat vector, and no additional training is carried out after the word2vetcor model training is finished, so that the output layer of the network can be ignored, and only the input weight of a hidden layer is used as word embedding representation;
the training result of the word2vetcor model can be adjusted by adjusting parameters such as the word frequency (min _ count), the maximum distance (window) between the current word and the predicted word in a sentence, the dimension (size) of the feature vector and the like;
in the two models, Skip-gram is suitable for small corpora and some rare terms, while CBOW model is more suitable for common words in common scenes and has faster training speed.
The modified cosine similarity calculation model is:
Figure BDA0002972498380000091
wherein i and j are vectors of the same dimension,
Figure BDA0002972498380000092
vector means, R, representing the ith and jth vectorsu,iThe u-th vector value, R, representing the input vector iu,jRepresents the u-th vector value of the input vector j;
the unmodified cosine similarity model is:
Figure BDA0002972498380000101
wherein, x and y are two n-dimensional vectors for calculating the similarity;
whether two texts are matched or not is judged, namely whether the similarity of word vectors expressing the semantics of the two texts is close or not is calculated, under the scene, a cosine similarity model is the most suitable and most widely used method, and the principle is that cosine values of an included angle between two vectors in a vector space are used as measurement for measuring the difference of the two vectors;
as shown in fig. 4, the closer the cosine value is to 1, the closer the included angle is to 0 degrees, i.e. the more similar the two vectors are;
as shown in fig. 5, the included angle between the two vectors a and b is small, so that the a vector and the b vector have high similarity, and in an extreme case, the a vector and the b vector completely coincide, and in this case, the a vector and the b vector are considered to be equal, that is, the texts represented by the a and b vectors are completely similar or equal;
as shown in fig. 6, if the angle between the a and b vectors is large, or in the opposite direction, the angle between the two vectors a and b is large, so that the a vector and the b vector have very low similarity, or the texts represented by the a and b vectors are basically dissimilar;
therefore, if the similarity calculation value of the chinese word segmentation is calculated by the unmodified cosine similarity model, the similarity in the dimension direction between two vectors can be obtained, and a better application can be obtained, which is a common practice at present, but such a basic cosine similarity model has its limitation, that is, only the similarity in the vector dimension direction is considered, but the difference of dimensions of each dimension is not considered, for example, such two vectors x1(20,30) and y1(20,31) their similarity value is calculated to be 0.99988695, which satisfies our needs, however, x is expressed1(20,30) magnification to x2After (40,60), vector x2Compare x1Although the length is doubled, the direction of the vector is not changed, so vector x2(40,60) and y1The similarity of (20,31) is still 0.99988695, which is not in accordance with the requirements of some scenes;
in the modified cosine similarity model, the vector length is also taken into consideration, the mean value is subtracted from each dimension, and the difference of dimensions of each dimension is taken into consideration, so that the effect of the cosine similarity model is optimized;
the text matching algorithm comprises the following steps:
s1, inputting 'consultation questions' in the intelligent question answering system;
s2, performing Chinese word segmentation on the 'consultation problem' through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with unified dimensionality;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, sorting the similarity calculation values in descending order, comparing the similarity calculation values with a given similarity threshold value of the fixed questions in the fixed question set, and selecting the fixed questions and the corresponding fixed answers with the similarity calculation values being the largest or larger than the similarity threshold value as the consultation answers of the consultation questions.
Specifically, the calculation method of the given similarity threshold includes the following steps:
s1, inputting a 'fixed question' in the intelligent question answering system;
s2, performing Chinese word segmentation on the fixed problem through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with fixed dimensions;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, sequencing the similarity calculation values in a descending order, and selecting the similarity calculation value with the largest similarity calculation value as a given similarity threshold value;
the Chinese word segmentation skill used in the invention can combine the advantages of the accurate mode and the search mode of the jieba word segmentation device to obtain the optimized jieba word segmentation device, and the respective disadvantages of the accurate mode and the search mode are abandoned, so that the problems of longest word priority and information overlapping are solved, the Chinese word segmentation with high efficiency in industry is realized, and the Chinese word segmentation device is a good choice under the condition that the training word segmentation model is insufficient in cost performance or the corpus data is not complete enough;
the word vector expression selects a word2vector model trained by a CBOW model, and the word vector obtained by considering the context information near the word, including the influence of other words and adjacent words around each word on the word meaning of the word and the influence of the relation between the words on the sentence semantics, can express the semantics of the word;
the invention uses the modified cosine similarity model, improves the original cosine similarity model, performs the modification operation of subtracting the mean value from each dimension, and considers the difference of dimensions of each dimension.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (7)

1. A text matching algorithm serving an intelligent question and answer system is characterized by comprising a question and answer library text data set, an optimized jieba word splitter, a trained word2vector model and a modified cosine similarity model;
the question-answer library text data set comprises a fixed question set and a fixed answer set corresponding to the fixed question set;
the optimized jieba word segmentation device comprises an accurate mode and a search mode;
training a text data set of a question and answer library through a CBOW model to obtain a trained word2vecor model;
the modified cosine similarity calculation model is as follows:
Figure FDA0002972498370000011
wherein i and j are of the same dimensionThe vector of (a) is determined,
Figure FDA0002972498370000012
vector means, R, representing the ith and jth vectorsu,iThe u-th vector value, R, representing the input vector iu,jRepresents the u-th vector value of the input vector j;
the text matching algorithm comprises the following steps:
s1, inputting 'consultation questions' in the intelligent question answering system;
s2, performing Chinese word segmentation on the 'consultation problem' through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with unified dimensionality;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, sorting the similarity calculation values in descending order, comparing the similarity calculation values with a given similarity threshold value of the fixed questions in the fixed question set, and selecting the fixed questions and the corresponding fixed answers with the similarity calculation values being the largest or larger than the similarity threshold value as the consultation answers of the consultation questions.
2. The text matching algorithm for a smart question-answering system according to claim 1, wherein the given similarity threshold calculation method comprises the following steps:
s1, inputting a 'fixed question' in the intelligent question answering system;
s2, performing Chinese word segmentation on the fixed problem through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with fixed dimensions;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, the similarity calculation values are sorted in descending order, and the similarity calculation value with the largest similarity calculation value is selected as the given similarity threshold.
3. The text matching algorithm for a service of an intelligent question and answer system according to claim 1, wherein the word segmentation pattern of the jieba word segmenter that is not optimized comprises a precise pattern.
4. The text matching algorithm for a service of an intelligent question answering system according to claim 1, wherein the word segmentation mode of the jieba word segmenter that is not optimized comprises a full mode.
5. The text matching algorithm for a service of an intelligent question answering system according to claim 1, wherein the word segmentation pattern of the jieba word segmenter that is not optimized comprises a search pattern.
6. The text matching algorithm for a service of an intelligent question and answer system according to claim 1, wherein the CBOW model expresses input chinese participles as 256-dimensional word vectors.
7. The text matching algorithm for a smart question-answering system according to claim 1, wherein the cosine similarity model without modification is:
Figure FDA0002972498370000031
wherein x and y are two n-dimensional vectors for calculating the similarity.
CN202110267040.2A 2021-03-11 2021-03-11 Text matching algorithm serving intelligent question-answering system Pending CN112988970A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110267040.2A CN112988970A (en) 2021-03-11 2021-03-11 Text matching algorithm serving intelligent question-answering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110267040.2A CN112988970A (en) 2021-03-11 2021-03-11 Text matching algorithm serving intelligent question-answering system

Publications (1)

Publication Number Publication Date
CN112988970A true CN112988970A (en) 2021-06-18

Family

ID=76334595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110267040.2A Pending CN112988970A (en) 2021-03-11 2021-03-11 Text matching algorithm serving intelligent question-answering system

Country Status (1)

Country Link
CN (1) CN112988970A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996439A (en) * 2022-08-01 2022-09-02 太极计算机股份有限公司 Text search method and device
CN115186679A (en) * 2022-07-15 2022-10-14 广东广信通信服务有限公司 Intelligent response method, device, computer equipment and storage medium
CN116820711A (en) * 2023-06-07 2023-09-29 上海幽孚网络科技有限公司 Task driven autonomous agent algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
US20170083507A1 (en) * 2015-09-22 2017-03-23 International Business Machines Corporation Analyzing Concepts Over Time
CN110362684A (en) * 2019-06-27 2019-10-22 腾讯科技(深圳)有限公司 A kind of file classification method, device and computer equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170083507A1 (en) * 2015-09-22 2017-03-23 International Business Machines Corporation Analyzing Concepts Over Time
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN110362684A (en) * 2019-06-27 2019-10-22 腾讯科技(深圳)有限公司 A kind of file classification method, device and computer equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王和勇: "《面向大数据的高维数据挖掘技术》", 31 March 2018 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186679A (en) * 2022-07-15 2022-10-14 广东广信通信服务有限公司 Intelligent response method, device, computer equipment and storage medium
CN114996439A (en) * 2022-08-01 2022-09-02 太极计算机股份有限公司 Text search method and device
CN116820711A (en) * 2023-06-07 2023-09-29 上海幽孚网络科技有限公司 Task driven autonomous agent algorithm
CN116820711B (en) * 2023-06-07 2024-05-28 上海幽孚网络科技有限公司 Task driven autonomous agent method

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
Gao et al. Convolutional neural network based sentiment analysis using Adaboost combination
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN108170848B (en) Chinese mobile intelligent customer service-oriented conversation scene classification method
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN113204952B (en) Multi-intention and semantic slot joint identification method based on cluster pre-analysis
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN112905795A (en) Text intention classification method, device and readable medium
CN114743020A (en) Food identification method combining tag semantic embedding and attention fusion
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN114330354B (en) Event extraction method and device based on vocabulary enhancement and storage medium
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN114818703B (en) Multi-intention recognition method and system based on BERT language model and TextCNN model
CN111914556A (en) Emotion guiding method and system based on emotion semantic transfer map
CN112232053A (en) Text similarity calculation system, method and storage medium based on multi-keyword pair matching
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN116226785A (en) Target object recognition method, multi-mode recognition model training method and device
CN108536781B (en) Social network emotion focus mining method and system
CN111984780A (en) Multi-intention recognition model training method, multi-intention recognition method and related device
CN114357151A (en) Processing method, device and equipment of text category identification model and storage medium
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN114417851A (en) Emotion analysis method based on keyword weighted information
Ko et al. Paraphrase bidirectional transformer with multi-task learning
Chan et al. Applying and optimizing NLP model with CARU
CN113065350A (en) Biomedical text word sense disambiguation method based on attention neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210618