CN112988970A - Text matching algorithm serving intelligent question-answering system - Google Patents
Text matching algorithm serving intelligent question-answering system Download PDFInfo
- Publication number
- CN112988970A CN112988970A CN202110267040.2A CN202110267040A CN112988970A CN 112988970 A CN112988970 A CN 112988970A CN 202110267040 A CN202110267040 A CN 202110267040A CN 112988970 A CN112988970 A CN 112988970A
- Authority
- CN
- China
- Prior art keywords
- word
- model
- question
- similarity
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 110
- 238000004364 calculation method Methods 0.000 claims abstract description 44
- 230000011218 segmentation Effects 0.000 claims description 57
- 238000012549 training Methods 0.000 claims description 14
- 238000012986 modification Methods 0.000 claims description 2
- 230000004048 modification Effects 0.000 claims description 2
- 238000012937 correction Methods 0.000 abstract description 2
- 238000012821 model calculation Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 5
- 238000000034 method Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text matching algorithm serving an intelligent question and answer system, which comprises a question and answer library text data set, an optimized jieba word splitter, a trained word2vector model and a modified cosine similarity model. In the invention, an optimized jieba word splitter is obtained by combining the advantages of a precise mode and a search mode of the jieba word splitting, after Chinese word splitting is carried out on a 'consultation problem', word vector embedding is carried out through a word2vector model, Chinese word splitting is converted into a word vector which can be calculated, cosine similarity model calculation after correction is carried out on the word vector, the precision of similarity calculation is improved, text similarity calculation is realized, finally, the similarity is sequenced, a similarity threshold is given, and a 'fixed problem' which has the highest similarity calculation value and exceeds the given similarity threshold in a text data set of a question-answer library and a corresponding 'fixed answer' are selected as a question-answer pair of the 'consultation problem'.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text matching algorithm serving an intelligent question-answering system.
Background
NLP (Natural Language Processing) is an important direction in the fields of computer science and artificial intelligence, and is a field in which computer science, artificial intelligence and linguistics focus on the interaction between computers and human (Natural) languages, and it studies various theories and methods that can realize effective communication between people and computers using Natural languages, and Natural Language Processing is a science integrating linguistics, computer science and mathematics, and thus, the study in this field will relate to Natural languages, i.e. languages used by people daily, it is closely related to, but significantly different from, the study of linguistics, natural language processing is not a general study of natural language, but rather to develop computer systems, and particularly software systems therein, that can efficiently implement natural language communications, and thus are part of computer science.
The invention provides a text matching algorithm serving an intelligent question-answering system, which is used for performing text similarity calculation through Chinese word segmentation and word vector embedding and trying to search a question-answer pair closest to a question in a question-answering library.
Disclosure of Invention
To solve the above-mentioned problems in the background art, a text matching algorithm serving an intelligent question-answering system is proposed.
In order to achieve the purpose, the invention adopts the following technical scheme:
a text matching algorithm serving an intelligent question-answering system comprises a question-answering library text data set, an optimized jieba word segmentation device, a trained word2vector model and a modified cosine similarity model;
the question-answer library text data set comprises a fixed question set and a fixed answer set corresponding to the fixed question set;
the optimized jieba word segmentation device comprises an accurate mode and a search mode;
training a text data set of a question and answer library through a CBOW model to obtain a trained word2vecor model;
the modified cosine similarity calculation model is as follows:
wherein i and j are vectors of the same dimension,vector means, R, representing the ith and jth vectorsu,iThe u-th vector value, R, representing the input vector iu,jRepresents the u-th vector value of the input vector j;
the text matching algorithm comprises the following steps:
s1, inputting 'consultation questions' in the intelligent question answering system;
s2, performing Chinese word segmentation on the 'consultation problem' through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with unified dimensionality;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, sorting the similarity calculation values in descending order, comparing the similarity calculation values with a given similarity threshold value of the fixed questions in the fixed question set, and selecting the fixed questions and the corresponding fixed answers with the similarity calculation values being the largest or larger than the similarity threshold value as the consultation answers of the consultation questions.
As a further description of the above technical solution:
the given similarity threshold calculation method comprises the following steps:
s1, inputting a 'fixed question' in the intelligent question answering system;
s2, performing Chinese word segmentation on the fixed problem through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with fixed dimensions;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, the similarity calculation values are sorted in descending order, and the similarity calculation value with the largest similarity calculation value is selected as the given similarity threshold.
As a further description of the above technical solution:
the word segmentation mode of the jieba word segmentation device which is not optimized comprises an accurate mode.
As a further description of the above technical solution:
the word segmentation mode of the jieba word segmentation device which is not optimized comprises a full mode.
As a further description of the above technical solution:
the word segmentation mode of the jieba word segmenter which is not optimized comprises a search mode.
As a further description of the above technical solution:
the CBOW model expresses input Chinese participles into 256-dimensional word vectors.
As a further description of the above technical solution:
the unmodified cosine similarity model is:
wherein x and y are two n-dimensional vectors for calculating the similarity.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. in the invention, an optimized jieba word splitter is obtained by combining the advantages of a precise mode and a search mode of the jieba word splitting, the problem that a fixed longest word in the precise mode is prior to information overlapping in the search mode is solved, after Chinese word splitting is carried out on a query problem, word vector embedding is carried out through a word2vector model, Chinese word splitting is converted into a word vector which can be calculated, cosine similarity model calculation after correction is carried out on the word vector, the precision of similarity calculation is improved, text similarity calculation is realized, finally the similarity is sequenced, a similarity threshold is given, and a fixed problem and a corresponding fixed answer which have the highest similarity calculation value and exceed the given similarity threshold in a database text data set are selected as a question-answer pair of the query problem.
2. In the invention, the used Chinese word segmentation skill can combine the advantages of the accurate mode and the search mode of the jieba word segmentation device to obtain the optimized jieba word segmentation device, but the respective disadvantages of the accurate mode and the search mode are abandoned, the problems of longest word priority and information overlapping are solved, the Chinese word segmentation with high efficiency in industry is realized, and the Chinese word segmentation device is a very good choice under the condition that the cost-performance ratio of the training word segmentation model is insufficient or the corpus data is not complete.
3. In the invention, the word vector expression selects the word2vector model trained by the CBOW model, and the obtained word vector can express the self semantics of the words by considering the context information near the words, including the influence of other words and adjacent words around each word on the word meaning of the word and the influence of the relation between the words on the semantics of the sentence.
4. In the invention, a modified cosine similarity model is used, the original cosine similarity model is improved, the mean value of each dimension is subtracted, and the difference of dimensions of each dimension is considered.
Drawings
FIG. 1 is a flow chart illustrating a text matching algorithm serving an intelligent question answering system according to an embodiment of the present invention;
FIG. 2 shows a schematic diagram of a model of a Skip-gram provided according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a CBOW model provided according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating two vectors a, b provided according to an embodiment of the present invention that are similar in cosine;
FIG. 5 is a diagram illustrating two vectors a, b provided according to an embodiment of the present invention with the cosine of the vectors a, b being equal;
fig. 6 is a schematic diagram illustrating that two vectors a, b provided according to an embodiment of the present invention are dissimilar in cosine.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1-6, the present invention provides a technical solution: a text matching algorithm serving an intelligent question-answering system comprises a question-answering library text data set, an optimized jieba word segmentation device, a trained word2vector model and a modified cosine similarity model;
the text data set of the question-answer library comprises a fixed question set and a fixed answer set corresponding to the fixed question set;
specifically, the corpus text dataset is shown in table 1 below:
TABLE 1 questionnaire text training set
The optimized jieba word segmentation device comprises an accurate mode and a search mode;
the word segmentation modes of the unoptimized jieba word segmentation device comprise three independent word segmentation modes, namely an accurate mode, a full mode and a search mode;
the accurate mode is that the sentence is tried to be cut open most accurately and is suitable for text analysis;
full mode: all words which can be formed into words in the sentence are scanned out, so that the speed is very high, but ambiguity cannot be solved;
a search mode: on the basis of an accurate mode, segmenting long words and rephrases, improving the recall rate and being suitable for word segmentation of a search engine;
specifically, assuming that there is a scene in which a fund manager is in a text data set of a questioning and answering library and a financial manager is not in the text data set, if a jieba word segmentation device which is not optimized is used for Chinese word segmentation in the scene, the following two word segmentation results are obtained, wherein the results of a full mode and a search mode are similar;
word segmentation result in the precision mode one: since the exact mode is the longest word first, we get the results "fund manager" and "financing manager";
and a second word segmentation result in the full mode and the search mode: the results obtained are "fund", "manager", "fund manager" and "financing", "manager", "financing manager".
In an actual application scenario, because the text data set of the questioning and answering library has the vocabulary of "fund manager" but not the "financing manager", the vocabulary of "financing manager" is not learned in the word2vector model, and therefore the word2vector model cannot recognize the vocabulary of "financing manager";
in this case, if we select the jieba word segmenter with the accurate mode, since the accurate mode is the longest word priority, our word2vector model can obtain the word vector of "fund manager" but cannot obtain the word vector of "finance manager";
if we select the jieba word segmentation device in the search mode, we can obtain three word segmentation devices of ' financing manager ', manager ' and ' financing manager ', although the word2vector model cannot recognize the word of ' financing manager ', the word vectors of ' financing ' and ' manager ' can be recognized to replace the meaning expressed by the ' financing manager ', but for ' fund manager ', in the jieba word segmentation device in the search mode, the word segmentation devices which can be recognized by the word2vector models of ' fund ', ' manager ' and ' fund manager ' are obtained, in this case, the two word vectors of ' fund ' and ' manager ' are redundant, and information overlapping is caused, and the word vector expression of the whole sentence is influenced when the word vectors are put in the text;
as shown above, for two vocabularies, namely "fund manager" and "financing manager", if we select the accurate mode of the jieba segmenter, our word vector model can process "fund manager", but because the default longest word is first in this mode and "financing manager" cannot be processed, if we select the search mode of the jieba segmenter, our word vector model can process "financing manager", but information overlapping will be caused when processing "fund manager", therefore, the invention considers that the accurate mode and the search mode of the segmenter are cooperatively used to obtain the optimized jieba segmenter to solve the problem that the longest word is first overlapped with the information, for each text, we select the accurate mode of the jieba segmenter to perform segmentation first, perform word2vector conversion on the segmented vocabulary, and then perform unrecognizable vocabulary which is not learned by the word2vector model, then, carrying out a search mode on the vocabulary and then segmenting words;
specifically, the application examples of the fund manager and the financing manager are also used, and the optimized jieba word segmentation device is used for Chinese word segmentation recognition:
for the fund manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager due to the fact that the word2vector model can be identified, for the fund manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager, for the financial manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager due to the fact that the word2vector model can not be identified, the word vector of the fund manager can be obtained to carry out searching mode re-segmentation again to obtain three words of financial, manager and financial manager, wherein the word vector of the financial manager can not be identified to directly skip, the word vector of the financial manager and the word vector of the manager can be obtained to jointly express the Chinese meaning of the financial manager, and therefore the effect of word segmentation can be greatly optimized through the flexible application of the optimized fund classifier, the efficiency is high and the accuracy is not lost;
training a text data set of a question and answer library through a CBOW model to obtain a trained word2vecor model, and specifically, expressing input Chinese word segmentation into 256-dimensional word vectors through the CBOW model;
the word2vector model network is composed of two layers of weights, a hidden layer is composed of n neurons, wherein n represents the vector dimension of a word, the vector dimension is unified into 256 dimensions, an input layer and an output layer both comprise M neurons, M is the total number of Chinese participles in a vocabulary table, an output layer activation function is a softmax function commonly used in the classification problem, and on the basis, two training methods of Skip-gram and COBW are provided:
as shown in fig. 2, the Skip-gram model predicts surrounding words according to a central word, takes the unique heat vector of the central word as an input vector and the unique heat vector of the surrounding words as an output to construct a training sample pair, and calculates the probability that the output word is the surrounding words by an output layer softmax function;
as shown in fig. 3, the COBW model predicts a central word according to surrounding words, creates a sum of unique heat vectors of the surrounding words as an input vector, and the unique heat vector of the central word as an output to construct a training sample pair, and derives a node word with the maximum probability as an output by an output layer softmax function;
after the neural network training is finished, the trained network weight can be used for representing semantics, one row in the weight matrix represents one word in a vocabulary table of a corpus through the conversion of a vocabulary entry unique heat vector, and no additional training is carried out after the word2vetcor model training is finished, so that the output layer of the network can be ignored, and only the input weight of a hidden layer is used as word embedding representation;
the training result of the word2vetcor model can be adjusted by adjusting parameters such as the word frequency (min _ count), the maximum distance (window) between the current word and the predicted word in a sentence, the dimension (size) of the feature vector and the like;
in the two models, Skip-gram is suitable for small corpora and some rare terms, while CBOW model is more suitable for common words in common scenes and has faster training speed.
The modified cosine similarity calculation model is:
wherein i and j are vectors of the same dimension,vector means, R, representing the ith and jth vectorsu,iThe u-th vector value, R, representing the input vector iu,jRepresents the u-th vector value of the input vector j;
the unmodified cosine similarity model is:
wherein, x and y are two n-dimensional vectors for calculating the similarity;
whether two texts are matched or not is judged, namely whether the similarity of word vectors expressing the semantics of the two texts is close or not is calculated, under the scene, a cosine similarity model is the most suitable and most widely used method, and the principle is that cosine values of an included angle between two vectors in a vector space are used as measurement for measuring the difference of the two vectors;
as shown in fig. 4, the closer the cosine value is to 1, the closer the included angle is to 0 degrees, i.e. the more similar the two vectors are;
as shown in fig. 5, the included angle between the two vectors a and b is small, so that the a vector and the b vector have high similarity, and in an extreme case, the a vector and the b vector completely coincide, and in this case, the a vector and the b vector are considered to be equal, that is, the texts represented by the a and b vectors are completely similar or equal;
as shown in fig. 6, if the angle between the a and b vectors is large, or in the opposite direction, the angle between the two vectors a and b is large, so that the a vector and the b vector have very low similarity, or the texts represented by the a and b vectors are basically dissimilar;
therefore, if the similarity calculation value of the chinese word segmentation is calculated by the unmodified cosine similarity model, the similarity in the dimension direction between two vectors can be obtained, and a better application can be obtained, which is a common practice at present, but such a basic cosine similarity model has its limitation, that is, only the similarity in the vector dimension direction is considered, but the difference of dimensions of each dimension is not considered, for example, such two vectors x1(20,30) and y1(20,31) their similarity value is calculated to be 0.99988695, which satisfies our needs, however, x is expressed1(20,30) magnification to x2After (40,60), vector x2Compare x1Although the length is doubled, the direction of the vector is not changed, so vector x2(40,60) and y1The similarity of (20,31) is still 0.99988695, which is not in accordance with the requirements of some scenes;
in the modified cosine similarity model, the vector length is also taken into consideration, the mean value is subtracted from each dimension, and the difference of dimensions of each dimension is taken into consideration, so that the effect of the cosine similarity model is optimized;
the text matching algorithm comprises the following steps:
s1, inputting 'consultation questions' in the intelligent question answering system;
s2, performing Chinese word segmentation on the 'consultation problem' through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with unified dimensionality;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, sorting the similarity calculation values in descending order, comparing the similarity calculation values with a given similarity threshold value of the fixed questions in the fixed question set, and selecting the fixed questions and the corresponding fixed answers with the similarity calculation values being the largest or larger than the similarity threshold value as the consultation answers of the consultation questions.
Specifically, the calculation method of the given similarity threshold includes the following steps:
s1, inputting a 'fixed question' in the intelligent question answering system;
s2, performing Chinese word segmentation on the fixed problem through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with fixed dimensions;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, sequencing the similarity calculation values in a descending order, and selecting the similarity calculation value with the largest similarity calculation value as a given similarity threshold value;
the Chinese word segmentation skill used in the invention can combine the advantages of the accurate mode and the search mode of the jieba word segmentation device to obtain the optimized jieba word segmentation device, and the respective disadvantages of the accurate mode and the search mode are abandoned, so that the problems of longest word priority and information overlapping are solved, the Chinese word segmentation with high efficiency in industry is realized, and the Chinese word segmentation device is a good choice under the condition that the training word segmentation model is insufficient in cost performance or the corpus data is not complete enough;
the word vector expression selects a word2vector model trained by a CBOW model, and the word vector obtained by considering the context information near the word, including the influence of other words and adjacent words around each word on the word meaning of the word and the influence of the relation between the words on the sentence semantics, can express the semantics of the word;
the invention uses the modified cosine similarity model, improves the original cosine similarity model, performs the modification operation of subtracting the mean value from each dimension, and considers the difference of dimensions of each dimension.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.
Claims (7)
1. A text matching algorithm serving an intelligent question and answer system is characterized by comprising a question and answer library text data set, an optimized jieba word splitter, a trained word2vector model and a modified cosine similarity model;
the question-answer library text data set comprises a fixed question set and a fixed answer set corresponding to the fixed question set;
the optimized jieba word segmentation device comprises an accurate mode and a search mode;
training a text data set of a question and answer library through a CBOW model to obtain a trained word2vecor model;
the modified cosine similarity calculation model is as follows:
wherein i and j are of the same dimensionThe vector of (a) is determined,vector means, R, representing the ith and jth vectorsu,iThe u-th vector value, R, representing the input vector iu,jRepresents the u-th vector value of the input vector j;
the text matching algorithm comprises the following steps:
s1, inputting 'consultation questions' in the intelligent question answering system;
s2, performing Chinese word segmentation on the 'consultation problem' through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with unified dimensionality;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, sorting the similarity calculation values in descending order, comparing the similarity calculation values with a given similarity threshold value of the fixed questions in the fixed question set, and selecting the fixed questions and the corresponding fixed answers with the similarity calculation values being the largest or larger than the similarity threshold value as the consultation answers of the consultation questions.
2. The text matching algorithm for a smart question-answering system according to claim 1, wherein the given similarity threshold calculation method comprises the following steps:
s1, inputting a 'fixed question' in the intelligent question answering system;
s2, performing Chinese word segmentation on the fixed problem through the optimized jieba word segmentation device;
s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with fixed dimensions;
s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;
s5, the similarity calculation values are sorted in descending order, and the similarity calculation value with the largest similarity calculation value is selected as the given similarity threshold.
3. The text matching algorithm for a service of an intelligent question and answer system according to claim 1, wherein the word segmentation pattern of the jieba word segmenter that is not optimized comprises a precise pattern.
4. The text matching algorithm for a service of an intelligent question answering system according to claim 1, wherein the word segmentation mode of the jieba word segmenter that is not optimized comprises a full mode.
5. The text matching algorithm for a service of an intelligent question answering system according to claim 1, wherein the word segmentation pattern of the jieba word segmenter that is not optimized comprises a search pattern.
6. The text matching algorithm for a service of an intelligent question and answer system according to claim 1, wherein the CBOW model expresses input chinese participles as 256-dimensional word vectors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110267040.2A CN112988970A (en) | 2021-03-11 | 2021-03-11 | Text matching algorithm serving intelligent question-answering system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110267040.2A CN112988970A (en) | 2021-03-11 | 2021-03-11 | Text matching algorithm serving intelligent question-answering system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112988970A true CN112988970A (en) | 2021-06-18 |
Family
ID=76334595
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110267040.2A Pending CN112988970A (en) | 2021-03-11 | 2021-03-11 | Text matching algorithm serving intelligent question-answering system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112988970A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114996439A (en) * | 2022-08-01 | 2022-09-02 | 太极计算机股份有限公司 | Text search method and device |
CN115186679A (en) * | 2022-07-15 | 2022-10-14 | 广东广信通信服务有限公司 | Intelligent response method, device, computer equipment and storage medium |
CN116820711A (en) * | 2023-06-07 | 2023-09-29 | 上海幽孚网络科技有限公司 | Task driven autonomous agent algorithm |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
US20170083507A1 (en) * | 2015-09-22 | 2017-03-23 | International Business Machines Corporation | Analyzing Concepts Over Time |
CN110362684A (en) * | 2019-06-27 | 2019-10-22 | 腾讯科技(深圳)有限公司 | A kind of file classification method, device and computer equipment |
-
2021
- 2021-03-11 CN CN202110267040.2A patent/CN112988970A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170083507A1 (en) * | 2015-09-22 | 2017-03-23 | International Business Machines Corporation | Analyzing Concepts Over Time |
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
CN110362684A (en) * | 2019-06-27 | 2019-10-22 | 腾讯科技(深圳)有限公司 | A kind of file classification method, device and computer equipment |
Non-Patent Citations (1)
Title |
---|
王和勇: "《面向大数据的高维数据挖掘技术》", 31 March 2018 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115186679A (en) * | 2022-07-15 | 2022-10-14 | 广东广信通信服务有限公司 | Intelligent response method, device, computer equipment and storage medium |
CN114996439A (en) * | 2022-08-01 | 2022-09-02 | 太极计算机股份有限公司 | Text search method and device |
CN116820711A (en) * | 2023-06-07 | 2023-09-29 | 上海幽孚网络科技有限公司 | Task driven autonomous agent algorithm |
CN116820711B (en) * | 2023-06-07 | 2024-05-28 | 上海幽孚网络科技有限公司 | Task driven autonomous agent method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
Gao et al. | Convolutional neural network based sentiment analysis using Adaboost combination | |
CN112732916B (en) | BERT-based multi-feature fusion fuzzy text classification system | |
CN108170848B (en) | Chinese mobile intelligent customer service-oriented conversation scene classification method | |
CN112988970A (en) | Text matching algorithm serving intelligent question-answering system | |
CN113204952B (en) | Multi-intention and semantic slot joint identification method based on cluster pre-analysis | |
CN113626589B (en) | Multi-label text classification method based on mixed attention mechanism | |
CN112905795A (en) | Text intention classification method, device and readable medium | |
CN114743020A (en) | Food identification method combining tag semantic embedding and attention fusion | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN114330354B (en) | Event extraction method and device based on vocabulary enhancement and storage medium | |
CN113255366B (en) | Aspect-level text emotion analysis method based on heterogeneous graph neural network | |
CN114818703B (en) | Multi-intention recognition method and system based on BERT language model and TextCNN model | |
CN111914556A (en) | Emotion guiding method and system based on emotion semantic transfer map | |
CN112232053A (en) | Text similarity calculation system, method and storage medium based on multi-keyword pair matching | |
CN113987187A (en) | Multi-label embedding-based public opinion text classification method, system, terminal and medium | |
CN116226785A (en) | Target object recognition method, multi-mode recognition model training method and device | |
CN108536781B (en) | Social network emotion focus mining method and system | |
CN111984780A (en) | Multi-intention recognition model training method, multi-intention recognition method and related device | |
CN114357151A (en) | Processing method, device and equipment of text category identification model and storage medium | |
CN113705238A (en) | Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
Ko et al. | Paraphrase bidirectional transformer with multi-task learning | |
Chan et al. | Applying and optimizing NLP model with CARU | |
CN113065350A (en) | Biomedical text word sense disambiguation method based on attention neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210618 |