CN115712704A - Sentence vector generating method and device, matching method and device and storage medium - Google Patents
Sentence vector generating method and device, matching method and device and storage medium Download PDFInfo
- Publication number
- CN115712704A CN115712704A CN202110955229.0A CN202110955229A CN115712704A CN 115712704 A CN115712704 A CN 115712704A CN 202110955229 A CN202110955229 A CN 202110955229A CN 115712704 A CN115712704 A CN 115712704A
- Authority
- CN
- China
- Prior art keywords
- sentence
- vector
- question
- filtered
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000013598 vector Substances 0.000 title claims abstract description 291
- 238000000034 method Methods 0.000 title claims abstract description 89
- 238000012545 processing Methods 0.000 claims abstract description 34
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 238000001914 filtration Methods 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 11
- 230000004927 fusion Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 238000002372 labelling Methods 0.000 claims description 2
- 238000013527 convolutional neural network Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 235000015243 ice cream Nutrition 0.000 description 6
- 238000000605 extraction Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 229910000906 Bronze Inorganic materials 0.000 description 1
- 239000010974 bronze Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- KUNSUQLRTQLHQQ-UHFFFAOYSA-N copper tin Chemical compound [Cu].[Sn] KUNSUQLRTQLHQQ-UHFFFAOYSA-N 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004215 lattice model Methods 0.000 description 1
- 235000012054 meals Nutrition 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 210000002268 wool Anatomy 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a sentence vector generating method and device, a matching method and device and a storage medium, wherein the sentence vector generating method comprises the following steps: preprocessing the received original sentence to obtain a sentence with the stop words filtered and the entity labels labeled; processing the sentences with the filtered stop words to obtain a first sentence vector, wherein the first sentence vector comprises attribute information of characters included in the sentences with the filtered stop words; processing the sentences after the stop words are filtered and the entity labels are labeled and the original sentences to obtain second sentence vectors, wherein the second sentence vectors comprise information of the entity labels; the target vectors of the original sentence include the first sentence vector and the second sentence vector. The target vector of the sentence obtained by the method can contain more effective information, and the accuracy of the result output by the intelligent question-answering system is improved.
Description
Technical Field
The application belongs to the field of natural language processing, and particularly relates to a sentence vector generation method and device, a matching method and device, and a storage medium.
Background
For an intelligent question-answering system, generally, a sentence vector of a question input by a user is generated, similarity matching is performed between the sentence vector of the question and a sentence vector of a question stored in advance in the intelligent question-answering system, a target question in the intelligent question-answering system and an answer corresponding to the target question are obtained, and the answer is output.
In the prior art, a sentence vector of a question is generally generated by adopting a text feature extraction method (for example, word2vec, tfidf or word2vec + tfidf). However, the sentence vector generation methods are weak in the aspects of grammar and word order understanding, so that the generated sentence vectors contain less effective information and cannot well express information in sentences, and the accuracy of answers output by the intelligent question-answering system is low.
Disclosure of Invention
In view of the above, an object of the present application is to provide a sentence vector generating method and apparatus, a matching method and apparatus, and a storage medium, where the sentence vector generated by the vector generating method can cover more effective information, and is beneficial to improving the accuracy of the result output by the intelligent question-answering system.
The embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a matching method, where the method includes: preprocessing the received original sentence to obtain a sentence with the stop words filtered and the entity labels labeled; processing the sentences with the filtered stop words to obtain a first sentence vector, wherein the first sentence vector comprises attribute information of characters included in the sentences with the filtered stop words; the attribute information comprises at least one of a font, a pronunciation and a stroke; processing the sentences which are used for filtering the stop words and labeled with the entity labels and the original sentences to obtain second sentence vectors, wherein the second sentence vectors comprise the information of the entity labels; the target vectors of the original sentence include the first sentence vector and the second sentence vector.
The target vector obtained after the original sentence is processed by the embodiment of the application comprises the entity tag information in the original sentence and the attribute information of the Chinese character without stop words in the original sentence.
With reference to the embodiment of the first aspect, in a possible implementation manner, when the attribute information includes a font style, a pronunciation, and a stroke, the processing a sentence after the stop word is filtered to obtain a first sentence vector includes: inputting the sentences filtered out of stop words into a field character grid-CNN model, and outputting to obtain a first subvector; inputting the sentence with the filtered stop word into a stroke pinyin model, and outputting to obtain a second subvector; and performing weighted fusion on the first sub-vector and the second sub-vector to obtain the first sentence vector.
With reference to the embodiment of the first aspect, in a possible implementation manner, the processing the sentence in which the stop word is filtered and the entity tag is labeled and the original sentence to obtain a second sentence vector includes: inputting the original sentence into a first ERNIE deep pre-training model, and outputting to obtain a third sub-vector; inputting the sentences which are filtered out stop words and labeled with entity labels into a second ERNIE deep pre-training model, and outputting to obtain a fourth subvector; and performing weighted fusion on the third sub-vector and the fourth sub-vector to obtain the second sentence vector.
With reference to the embodiment of the first aspect, in a possible implementation manner, the preprocessing the received original sentence includes: performing word segmentation processing on the original sentence input through a word segmentation tool to obtain a plurality of words; filtering words in the plurality of words which belong to a preset stop word bank to obtain sentences of which stop words are filtered; inputting the sentences with the filtered stop words into an entity tag identification model, and outputting to obtain entity tags of all words included in the sentences with the filtered stop words; and recombining all words including the entity labels according to the original sequence to obtain the sentences of the filtering stop words and the entity labels.
In a second aspect, an embodiment of the present application provides a matching method, where the method includes: processing the obtained question according to the sentence vector generation method in any embodiment of the first aspect to generate a target vector of the question; calculating the similarity between the target vector of the question and the target vector of the question in a database, and determining a target question according to the similarity; each question stored in the database has a corresponding answer; and returning an answer corresponding to the target question.
When the intelligent question-answering system matches answers to the question input by the user through the matching method provided by the embodiment of the application, the information content contained in the vector used for comparison is far greater than that of the vector adopted in the prior art, so that the vector in the embodiment of the application can represent the meaning expressed by the sentence more accurately, and the accuracy of the result output by the intelligent question-answering system can be improved.
With reference to the second aspect, in a possible implementation manner, the calculating a similarity between the target vector of the question and the target vector of the question in the database includes: calculating the similarity between the target vector of the question and the target vector of the candidate question in the database; wherein the candidate problem is determined by: generating an initial vector of the question according to a preset rule; and performing similarity matching on the initial vector and the initial vector of the problem stored in the database to obtain the candidate problem.
In this embodiment, the similarity between the target vector of the question and the target vector of the candidate answer in the database is calculated, so that the calculation amount can be reduced, and the subsequent processing efficiency can be improved.
With reference to the embodiment of the second aspect, in one possible implementation manner, sub databases corresponding to different intents are stored in the database; before the similarity matching the initial vector with the initial vector of the question stored in the database, the method further comprises: performing intention identification on the question, and determining the intention of the question; correspondingly, the similarity matching of the initial vector and the initial vector of the question stored in the database includes: and performing similarity matching on the initial vector and the initial vector of the problem in the sub-database corresponding to the intention in the database.
In this embodiment, the initial vector of the question may be subjected to similarity matching with the initial vector of the problem in the sub-database corresponding to the intention of the question in the database, so as to achieve the effect of reducing the problem to be matched and improve the subsequent processing efficiency.
In a third aspect, an embodiment of the present application provides a sentence vector generating apparatus, where the apparatus includes: the device comprises a preprocessing module and a generating module.
The preprocessing module is used for preprocessing the received original sentences to obtain sentences with filtered stop words and labeled entity labels;
a generating module, configured to process the sentence with the filtered stop word to obtain a first sentence vector, where the first sentence vector includes attribute information of characters included in the sentence with the filtered stop word; the attribute information comprises at least one of a font, a pronunciation and a stroke;
the generating module is further configured to process the sentences obtained by filtering the stop words and labeling the entity tags and the original sentences to obtain second sentence vectors, where the second sentence vectors include information of the entity tags;
the target vectors of the original sentence include the first sentence vector and the second sentence vector.
With reference to the embodiment of the third aspect, in a possible implementation manner, when the attribute information includes a glyph, a pronunciation, and a stroke, the generating module is configured to: inputting the sentence with the stop words filtered into a field character grid-CNN model, and outputting to obtain a first subvector; inputting the sentence with the filtered stop word into a stroke pinyin model, and outputting to obtain a second subvector; and performing weighted fusion on the first sub-vector and the second sub-vector to obtain the first sentence vector.
With reference to the embodiment of the third aspect, in a possible implementation manner, the generating module is configured to: inputting the original sentence into a first ERNIE deep pre-training model, and outputting to obtain a third sub-vector; inputting the sentences which are filtered out stop words and labeled with entity labels into a second ERNIE deep pre-training model, and outputting to obtain a fourth subvector; and performing weighted fusion on the third sub-vector and the fourth sub-vector to obtain the second sentence vector.
With reference to the third aspect, in a possible implementation manner, the preprocessing module is configured to: performing word segmentation processing on the original sentence input through a word segmentation tool to obtain a plurality of words; filtering words belonging to a preset stop word library in the plurality of words to obtain sentences of which stop words are filtered; inputting the sentences with the filtered stop words into an entity label recognition model, and outputting to obtain entity labels of all words included in the sentences with the filtered stop words; and recombining all words including the entity labels according to the original sequence to obtain the sentences of the filtering stop words and the labeled entity labels.
In a fourth aspect, an embodiment of the present application further provides a matching apparatus, including: the device comprises a generating module, a calculating module and a returning module.
The generating module is used for processing the obtained question according to a sentence vector generating method to generate a target vector of the question;
the calculation module is used for calculating the similarity between the target vector of the question and the target vector of the question in the database and determining the target question according to the similarity; each question stored in the database has a corresponding answer;
and the return module is used for returning the answer corresponding to the target question.
With reference to the second aspect, in one possible implementation manner, the calculation module is configured to: calculating the similarity between the target vector of the question and the target vector of the candidate question in the database; wherein the candidate problem is determined by: generating an initial vector of the question according to a preset rule; and performing similarity matching on the initial vector and the initial vector of the problem stored in the database to obtain the candidate problem.
With reference to the embodiment of the second aspect, in one possible implementation manner, sub databases corresponding to different intents are stored in the database; the device further comprises:
the determining module is used for identifying the intention of the question and determining the intention of the question;
and the matching module is used for matching the initial vector with the similarity of the initial vectors of the problems in the sub-database corresponding to the intention in the database.
In a fifth aspect, an embodiment of the present application further provides an electronic device, including: a memory and a processor, the memory and the processor connected; the memory is used for storing programs; the processor calls a program stored in the memory to perform the method of the first aspect embodiment and/or any possible implementation manner of the first aspect embodiment; or to carry out the method of the second aspect embodiment described above and/or as provided in connection with any one of the possible implementations of the second aspect embodiment.
In a sixth aspect, embodiments of the present application further provide a non-volatile computer-readable storage medium (hereinafter, referred to as a storage medium), on which a computer program is stored, where the computer program is executed by a computer to perform the method in the foregoing first aspect and/or any possible implementation manner of the first aspect; or to carry out the above-described embodiments of the second aspect and/or the methods provided in connection with any one of the possible implementations of the embodiments of the second aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.
Fig. 1 is a flowchart illustrating a sentence vector generation method provided in an embodiment of the present application;
fig. 2 is a flowchart illustrating a matching method provided in an embodiment of the present application;
fig. 3 is a block diagram illustrating a structure of a sentence vector generation apparatus according to an embodiment of the present application;
fig. 4 is a block diagram illustrating a matching apparatus according to an embodiment of the present disclosure;
fig. 5 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Icon: 100-an electronic device; 110-a processor; 120-a memory; 400-sentence vector generation means; 410-a pre-processing module; 420-a generation module; 500-a matching device; 510-a generation module; 520-a calculation module; 530-Return Module.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
In addition, the drawbacks of the sentence vector generation method in the prior art (the sentence vector covers less effective information, which results in lower accuracy of the result output by the intelligent question-answering system) are all the results obtained after the applicant has practiced and studied carefully, and therefore, the discovery process of the above-mentioned drawbacks and the solutions proposed in the following embodiments of the present application for the above-mentioned drawbacks should be considered as contributions of the applicant to the present application.
In order to solve the above problems, embodiments of the present application provide a sentence vector generation method and apparatus, a matching method and apparatus, and a storage medium, where the sentence vector generated by the vector generation method can cover more effective information, and is beneficial to improving the accuracy of a result output by an intelligent question-answering system.
The technology can be realized by adopting corresponding software, hardware and a combination of software and hardware. The following describes embodiments of the present application in detail. The following description will be made with respect to a matching method provided in the present application.
Referring to fig. 1, an embodiment of the present application provides a sentence vector generation method, which may include the following steps.
Step S110: and preprocessing the received original sentence to obtain the sentence with the filtered stop words and the entity label.
In the embodiment of the application, the obtained original sentence can be preprocessed, stop words in the original sentence are filtered, and entity labels are added to the sentence after the stop words are filtered.
The stop words refer to functional words included in natural human languages, which generally do not include actual meanings, and are mostly words of tone, such as "and" wool ".
In some embodiments, the stop word library may be constructed according to actual requirements, and various words and/or phrases for representing stop words are included in the stop word library.
In the preprocessing, a word segmentation tool (e.g., stanford, hanLP, jieba, etc.) may be used to perform a word segmentation on an input original sentence to obtain a plurality of words included in the original sentence.
For example, for the original sentence "how to handle an application for a smooth ice cream set? ", after word segmentation processing, we get: how, what to do, and what to do is a smooth ice cream set, application, and woolen? ".
And filtering words belonging to the stop word library aiming at a plurality of words included in the original sentence to obtain the sentence with the stop words filtered.
For example, what, applications, worship for "how, do, chang shuang ice cream set? ", filtering stop words to obtain: how to do, handle, apply for a smooth ice cream set, is? ".
With respect to entity labels, it is meant the named entity of a word that is used to qualify the category to which the word belongs.
In the embodiment of the present application, a basic model (e.g., a Hidden Markov Model (HMM), a conditional random field model (CRF), etc.) may be trained in advance to obtain an entity label recognition model.
The samples used for training the basic model are independent sentences, and the entity labels corresponding to the words in each sentence are manually labeled. When a word group is formed by a plurality of words, the first word in the words is labeled by an entity: B-X (X is used to denote a specific sub-label); the last of these words is labeled with an entity: E-X; other words in these several words are labeled with entities: M-X. Wherein, B represents 'Begin', M represents 'Intermediate', and E represents 'End'.
In the embodiment of the application, after the sentence with the filtered stop word is obtained, the sentence with the filtered stop word can be input into the entity tag recognition model, and the entity tag of each word included in the sentence with the filtered stop word is output.
For example, for "how, do you handle, apply for a set of ice cream, do you get well? After the identification is input into the entity label identification model, the smooth ice cream set is identified as an entity, and the obtained identification result is as follows: "Chang/B-Unicom Package cool/I-Unicom package ice/I-Unicom package laser/I-Unicom package slush/I-Unicom package set/I-Unicom package meal/E-Unicom package". The UnicomPackage is a sub-label used to represent the package name.
For another example, for "how, top up, call charge? After the "call charge" is input to the entity label recognition model, the "call charge" is recognized as an entity, and the obtained recognition result is: "XlBBusiness fee/E-Business". Business is a sub-label used to represent a service.
And after the entity labels of the words are obtained, recombining the words comprising the entity labels according to the original sequence to obtain the sentences of which the stop words are filtered and the entity labels are labeled.
Step S120: and processing the sentences after the stop words are filtered to obtain a first sentence vector.
In the embodiment of the present application, the sentences from which stop words are filtered may be processed to obtain the first sentence vectors corresponding to the sentences. The first sentence vector includes attribute information of characters included in the sentence after the stop word is filtered.
The attribute information of the text may include at least one of font, pronunciation, and stroke information.
The font information of the characters mainly refers to the font information of the Chinese characters.
In the embodiment of the application, each chinese character in the sentence may be converted into an ancient chinese character (e.g., bronze inscription, seal script, traditional chinese, etc.) according to a field lattice model, and a CNN (Convolutional Neural Network) image processing technique is used to perform feature extraction on pixel points of the ancient chinese character to obtain word vectors, and the word vectors are combined to obtain a sentence vector of the sentence.
The structure of the Chinese characters can be divided into components and radicals, and if a plurality of Chinese characters have the same component, the Chinese characters can represent the same meaning, even the pronunciation is the same. The components and radicals can be divided into linear combinations of horizontal, vertical, left-falling, right-falling and turning numbers. The pronunciation information and stroke information of the characters are mainly used for expressing the relevance between the pronunciation of the Chinese character and the structure of the Chinese character and the linear combination between the strokes of the Chinese character.
In the embodiment of the application, a word vector can be formed based on strokes and pinyin of each Chinese character in a sentence according to the stroke pinyin model, and the word vectors are combined to obtain a sentence vector of the sentence.
The sample used for training the stroke pinyin model is a sentence, and each word in the sentence is labeled with the corresponding stroke and pinyin in advance.
In some embodiments, when the attribute information includes a font, a pronunciation, and a stroke, processing the sentence after filtering the stop word to obtain a first sentence vector may include:
inputting the sentence with the stop words filtered into a field character grid-CNN model, and outputting to obtain a first subvector; inputting the sentences filtered to stop words into the stroke pinyin model, and outputting to obtain a second subvector; and performing weighted fusion on the first sub-vector and the second sub-vector to obtain a first sentence vector.
For example, the weight of the first sub-vector is 0.6, and the weight of the second sub-vector is 0.4, when the first sub-vector and the second sub-vector are fused, the method is as follows: first subvector × 0.6+ first subvector × 0.4= first sentence vector.
Generally, the generation methods of Chinese characters mainly include six types: pictograph, meeting meaning, finger affair, pictophonetic character, and commentary. The obtained first sentence vector contains the font information, pronunciation information and stroke information of each Chinese character in the sentence after the stop word is filtered, and the introduction of the six Chinese character generation modes is completed, so that the finally formed first sentence vector contains a large amount of information of the character.
Step S130: and processing the sentences after the stop words are filtered and the entity labels are labeled and the original sentences to obtain a second sentence vector.
And the second sentence vector comprises information of the entity labels in the sentences after the stop words are filtered and the entity labels are labeled.
In some embodiments, the third subvector may be obtained by inputting the original sentence into the first ERNIE deep pre-training model and outputting the third subvector. And inputting the sentences which are filtered out of stop words and labeled with entity labels into a second ERNIE deep pre-training model, and outputting to obtain a fourth subvector.
And performing weighted fusion on the third subvector and the fourth subvector to obtain a second sentence vector.
For example, the weight of the third sub-vector is 0.2, and the weight of the fourth sub-vector is 0.8, when the third sub-vector and the fourth sub-vector are fused, it is: third subvector x 0.2+ fourth subvector x 0.8= second sentence vector.
As for the ERNIE deep pre-training model, it is a mature prior art and is not described herein again.
After the first sentence vector and the second sentence vector are obtained, the target vector of the original sentence can be represented by the first sentence vector and the second sentence vector, that is, the target vector of the original sentence includes the first sentence vector and the second sentence vector.
As can be seen from the above process, the target vector obtained by processing the original sentence in the embodiment of the present application includes complete sentence information (i.e., the third sub-vector) in the original sentence, entity tag information (i.e., the fourth sub-vector) in the original sentence, and generation manner information (the first sub-vector and/or the second sub-vector) of the chinese character without stop words in the original sentence, and compared with a manner of extracting a sentence vector by using a text feature extraction manner in the prior art, the target sentence vector generated by the present solution includes more information, and can be expressed in the original sentence more.
In addition, referring to fig. 2, an embodiment of the present application further provides a matching method applied to an intelligent question answering system, where the method may include the following steps:
step S210: and processing the obtained question according to a sentence vector generation method to generate a target vector of the question.
When the user has consultation requirements, the question sentence can be input in a dialog box provided by the intelligent question-answering system in a keyboard input or voice input mode, and is submitted to a processing unit of the intelligent question-answering system.
After the processing unit of the intelligent question-answering system obtains the question, a target vector corresponding to the question, which is referred to as a target vector a herein, may be generated according to the above-mentioned sentence vector generation manner.
As can be seen from the above description, the target vector includes two vectors, namely, a first sentence vector and a second sentence vector, which are referred to as A1 and A2.
Step S220: and calculating the similarity between the target vector of the question and the target vector of the question in the database, and determining the target question according to the similarity.
Step S230: and returning an answer corresponding to the target question.
It should be noted that in the embodiment of the present application, a large number of questions are stored in the local or cloud database of the intelligent question and answer system, and each question has a corresponding answer. The question and its corresponding answer are pre-saved in a database by the developer.
In the embodiment of the present application, the target question can be determined from the questions located in the database by calculating the similarity between the target vector a of the question and the target vector of each question in the database (referred to as target vector B).
Wherein the similarity between the target vector B of the target question and the target vector a of the question exceeds a first similarity threshold, e.g. 80%.
After the target question is obtained, the intelligent question-answering system can return the answer corresponding to the target question in the database, and the answer is displayed by a display device which is provided with the intelligent question-answering system or is externally connected with the intelligent question-answering system.
Of course, it is worth pointing out that, for the case where the similarity between the target vectors B of the plurality of questions and the target vector a of the question exceeds the first similarity threshold, in an embodiment, only the question with the highest first similarity threshold may be determined as the target question, and the answer corresponding to the target question may be returned.
In another embodiment, all questions exceeding the first similarity threshold may be determined as target questions, and answers corresponding to all the target questions may be returned.
Of course, for each question, the corresponding target vector B is also generated in accordance with the above sentence vector generation method, and includes a first sentence vector and a second sentence vector, which are abbreviated as B1 and B2, respectively.
In some embodiments, the target vector B corresponding to each question may be generated and persisted in a database upon initialization of the intelligent question-answering system; in other embodiments, the target vector B corresponding to each question may also be generated in real time when it is desired to use the target vector B.
As mentioned above, the target vector a of the question includes A1 and A2, and the target vector B of the question includes B1 and B2, and when calculating the similarity between the target vector a of the question and the target vector B of each question, the similarity between A1 and B1 may be calculated to obtain a first similarity, the similarity between A2 and B2 may be calculated to obtain a second similarity, and then the first similarity and the second similarity are weighted and summed, and the weighted and summed result is used as the similarity between the target vector a of the question and the target vector B of the question.
The weights of the first similarity and the second similarity are configured by the developer in advance, for example, in an embodiment, the weight of the first similarity is 0.5, and the weight of the second similarity is 0.5. Of course, the weights of the first similarity and the second similarity may be adjusted according to actual situations.
As for calculating the similarity between two vectors, it can be based on the formula:to perform the calculation. Where a is one of the two vectors (e.g., A1), B is the other of the two vectors (e.g., B1), and | a | | | represents the modular length of the vector a, and | B | | | represents the modular length of the vector B, and the value of cos θ is the similarity between the vector a and the vector B.
Furthermore, in some embodiments, before step S220 is performed, a screening may be performed on the questions in the database to screen out candidate questions.
In some embodiments, candidate questions may be determined as follows.
Initial vectors (generated by text feature extraction, for example, word2vec, tfidf, or word2vec + tfidf) corresponding to the respective questions are stored in advance in the database.
After the question is obtained, an initial vector of the question is generated according to a preset rule, such as a text feature extraction mode. The initial vectors are then similarity matched to the initial vectors of the questions in the database, and questions with a similarity above a second similarity threshold (e.g., 50%) are determined as candidate questions.
After the candidate question is obtained, correspondingly, in step S220, the similarity between the target vector a of the question and the target vector B of the candidate question may be calculated, so as to reduce the amount of calculation and improve the processing efficiency.
In addition, in some embodiments, the intelligent question-answering system can answer related questions in different fields or in different scenes, and accordingly, sub-databases corresponding to different intentions are stored in the database.
In order to reduce subsequent calculation amount and improve processing efficiency, before similarity matching is carried out on the initial vector and the initial vector of the question stored in the database, intention recognition can be carried out on the question input by the user, and the intention of the question input by the user is determined.
The intention recognition is mainly to detect keywords in the question sentence or the sentence pattern of the question sentence.
Optionally, if the question is detected to include the preset keyword, determining that the intention of the question is the intention corresponding to the matched preset keyword; if the preset sentence pattern is adopted in the question, the intention of the question can be determined as the intention corresponding to the matched sentence pattern.
For example, for a question "i want to handle number portability service", a preset keyword "number portability" is matched, and then the intention of the question is the intention corresponding to "number portability" - "service handling-number portability".
For example, for a question "how to handle number portability service", if a preset sentence pattern is matched to "how to XXXX number portability", the intention of the question is an intention corresponding to "how to XXXX number portability" - "service handling-number portability".
If the intention of the question is not matched in the two ways, in some embodiments, the intention corresponding to the entity tag included in the question may be output by inputting a sentence, which is obtained by preprocessing the question, is a filtering stop word and is labeled with the entity tag, to a multi-tag classification model (Text-CNN).
The multi-label classification model is a mature prior art, and is not described herein again.
After the intention of the question is obtained, when the initial vector is subjected to similarity matching with the initial vector of the question stored in the database, the initial vector of the question and the initial vector of the question in the sub-database corresponding to the intention of the question in the database can be subjected to similarity matching, so that the effect of reducing the problems needing to be matched is achieved.
When the intelligent question-answering system matches answers to the question sentences input by the user through the matching method provided by the embodiment of the application, the information content of the vector used for comparison is far greater than that of the vector adopted in the prior art, so that the vector in the embodiment of the application can represent the meaning expressed by the sentences more accurately, and the accuracy of the result output by the intelligent question-answering system can be improved.
As shown in fig. 3, an embodiment of the present application further provides a sentence vector generating apparatus 400, where the sentence vector generating apparatus 400 may include: a preprocessing module 410 and a generation module 420.
A preprocessing module 410, configured to preprocess the received original sentence to obtain a sentence with the stop word filtered and the entity tag labeled;
a generating module 420, configured to process the sentence with the filtered stop word to obtain a first sentence vector, where the first sentence vector includes attribute information of characters included in the sentence with the filtered stop word; the attribute information comprises at least one of a font, a pronunciation and a stroke;
the generating module 420 is further configured to process the sentence with the filtering stop word and the entity tag labeled and the original sentence to obtain a second sentence vector, where the second sentence vector includes information of the entity tag;
the target vectors of the original sentence include the first sentence vector and the second sentence vector.
In a possible implementation manner, when the attribute information includes a font, a pronunciation, and a stroke, the generating module 420 is configured to: inputting the sentence with the stop words filtered into a field character grid-CNN model, and outputting to obtain a first subvector; inputting the sentence with the filtered stop word into a stroke pinyin model, and outputting to obtain a second subvector; and performing weighted fusion on the first sub-vector and the second sub-vector to obtain the first sentence vector.
In a possible implementation, the generating module 420 is configured to: inputting the original sentence into a first ERNIE deep pre-training model, and outputting to obtain a third sub-vector; inputting the sentences which are filtered out of stop words and labeled with entity labels into a second ERNIE deep pre-training model, and outputting to obtain a fourth subvector; and performing weighted fusion on the third sub-vector and the fourth sub-vector to obtain the second sentence vector.
In a possible implementation, the preprocessing module 410 is configured to: performing word segmentation processing on the original sentence input through a word segmentation tool to obtain a plurality of words; filtering words belonging to a preset stop word library in the plurality of words to obtain sentences of which stop words are filtered; inputting the sentences with the filtered stop words into an entity label recognition model, and outputting to obtain entity labels of all words included in the sentences with the filtered stop words; and recombining all words including the entity labels according to the original sequence to obtain the sentences of the filtering stop words and the labeled entity labels.
The sentence vector generating apparatus 400 provided in the embodiment of the present application has the same implementation principle and the same technical effect as those of the foregoing method embodiments, and for the sake of brief description, no mention is made in the apparatus embodiment, and reference may be made to the corresponding contents in the foregoing method embodiments.
As shown in fig. 4, an embodiment of the present application further provides a matching apparatus 500, where the matching apparatus 500 may include: a generation module 510, a calculation module 520, and a return module 530.
A generating module 510, configured to process the obtained question according to a sentence vector generating method, and generate a target vector of the question;
a calculating module 520, configured to calculate similarity between the target vector of the question and a target vector of a question in a database, and determine a target question according to the similarity; each question stored in the database has a corresponding answer;
a returning module 530, configured to return an answer corresponding to the target question.
In a possible implementation, the calculating module 520 is configured to:
calculating the similarity between the target vector of the question and the target vector of the candidate question in the database;
wherein the candidate problem is determined by:
generating an initial vector of the question according to a preset rule;
and performing similarity matching on the initial vector and the initial vector of the problem stored in the database to obtain the candidate problem.
In one possible embodiment, sub-databases corresponding to different intents are stored in the database; the device further comprises:
the determining module is used for identifying the intention of the question and determining the intention of the question;
and the matching module is used for matching the initial vector with the similarity of the initial vectors of the problems in the sub-database corresponding to the intention in the database.
The matching apparatus 500 provided in the embodiment of the present application has the same implementation principle and the same technical effect as those of the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments for parts of the apparatus embodiments that are not mentioned.
In addition, the present application further provides a storage medium, where a computer program is stored, and when the computer program is executed by a computer, the steps included in the matching method described above are executed.
In addition, referring to fig. 5, an embodiment of the present application further provides an electronic device 100 for implementing a matching method and apparatus and/or a sentence vector generating method and apparatus in the embodiment of the present application.
Alternatively, the electronic Device 100 may be, but is not limited to, a Personal Computer (PC), a smart phone, a tablet PC, a Mobile Internet Device (MID), a Personal digital assistant, a server, and the like. The server may be, but is not limited to, a web server, a database server, a cloud server, and the like.
Among them, the electronic device 100 may include: a processor 110, a memory 120.
It should be noted that the components and structure of electronic device 100 shown in FIG. 5 are exemplary only, and not limiting, and electronic device 100 may have other components and structures as desired.
The processor 110, memory 120, and other components that may be present in the electronic device 100 are electrically connected to each other, directly or indirectly, to enable the transfer or interaction of data. For example, the processor 110, the memory 120, and other components that may be present may be electrically coupled to each other via one or more communication buses or signal lines.
The memory 120 is used for storing programs, such as programs corresponding to the matching methods or sentence vector generation methods mentioned above, or matching devices or sentence vector generation devices mentioned above. Optionally, when the matching device or sentence vector generating device is stored in the memory 120, the matching device or sentence vector generating device includes at least one software functional module that can be stored in the memory 120 in the form of software or firmware (firmware).
Alternatively, the software function module included in the matching device or the sentence vector generating device may also be solidified in an Operating System (OS) of the electronic device 100.
The processor 110 is adapted to execute an executable module stored in the memory 120, such as a software functional module or a computer program included in the matching means or sentence vector generating means. When the processor 110 receives the execution instruction, it may execute the computer program, for example, to perform: preprocessing the received original sentence to obtain a sentence with the stop words filtered and the entity labels labeled; processing the sentences with the filtered stop words to obtain a first sentence vector, wherein the first sentence vector comprises attribute information of characters included in the sentences with the filtered stop words; the attribute information comprises at least one of a font, a pronunciation and a stroke; processing the sentences after the stop words are filtered and the entity labels are labeled and the original sentences to obtain second sentence vectors, wherein the second sentence vectors comprise information of the entity labels; the target vectors of the original sentence include the first sentence vector and the second sentence vector.
Or performing: processing the obtained question according to a sentence vector generation method to generate a target vector of the question; calculating the similarity between the target vector of the question and the target vector of the question in a database, and determining a target question according to the similarity; each question stored in the database has a corresponding answer; and returning an answer corresponding to the target question.
Of course, the method disclosed in any of the embodiments of the present application can be applied to the processor 110, or implemented by the processor 110.
In summary, according to the sentence vector generation method and apparatus, the matching method and apparatus, and the storage medium provided in the embodiments of the present invention, the target vector obtained by processing the original sentence in the embodiments of the present application includes the entity tag information in the original sentence and the attribute information of the chinese character without stop words in the original sentence.
In addition, when the intelligent question-answering system matches answers to the question sentences input by the user through the matching method provided by the embodiment of the application, the information content of the vectors used for comparison is far greater than that of the vectors adopted in the prior art, so that the vectors in the embodiment of the application can represent the meaning expressed by the sentences more accurately, and the accuracy of the results output by the intelligent question-answering system can be improved.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions may be stored in a storage medium if they are implemented in the form of software function modules and sold or used as independent products. Based on such understanding, the technical solutions of the present application, or portions thereof, may be substantially or partially embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.
Claims (10)
1. A method of generating a sentence vector, the method comprising:
preprocessing the received original sentence to obtain a sentence with the filtered stop words and labeled entity labels;
processing the sentences with the filtered stop words to obtain a first sentence vector, wherein the first sentence vector comprises attribute information of characters included in the sentences with the filtered stop words; the attribute information comprises at least one of a font, a pronunciation and a stroke;
processing the sentences after the stop words are filtered and the entity labels are labeled and the original sentences to obtain second sentence vectors, wherein the second sentence vectors comprise information of the entity labels;
the target vectors of the original sentence include the first sentence vector and the second sentence vector.
2. The method of claim 1, wherein when the attribute information includes a font style, a pronunciation, and a stroke, the processing the sentence after the stop word is filtered to obtain a first sentence vector comprises:
inputting the sentence with the stop words filtered into a field character grid-CNN model, and outputting to obtain a first subvector;
inputting the sentence with the filtered stop word into a stroke pinyin model, and outputting to obtain a second subvector;
and performing weighted fusion on the first sub-vector and the second sub-vector to obtain the first sentence vector.
3. The method of claim 1, wherein the processing the sentence after the stop word is filtered and the entity tag is labeled and the original sentence to obtain a second sentence vector comprises:
inputting the original sentence into a first ERNIE deep pre-training model, and outputting to obtain a third sub-vector;
inputting the sentences which are filtered out stop words and labeled with entity labels into a second ERNIE deep pre-training model, and outputting to obtain a fourth subvector;
and performing weighted fusion on the third sub-vector and the fourth sub-vector to obtain the second sentence vector.
4. The method according to any one of claims 1-3, wherein the pre-processing the received original sentence comprises:
performing word segmentation processing on the original sentence input through a word segmentation tool to obtain a plurality of words;
filtering words in the plurality of words which belong to a preset stop word bank to obtain sentences of which stop words are filtered;
inputting the sentences with the filtered stop words into an entity tag identification model, and outputting to obtain entity tags of all words included in the sentences with the filtered stop words;
and recombining all words including the entity labels according to the original sequence to obtain the sentences of the filtering stop words and the labeled entity labels.
5. A method of matching, the method comprising:
processing the obtained question according to the sentence vector generation method of any one of claims 1 to 4 to generate a target vector of the question;
calculating the similarity between the target vector of the question and the target vector of the question in a database, and determining a target question according to the similarity; each question stored in the database has a corresponding answer;
and returning an answer corresponding to the target question.
6. The method of claim 5, wherein calculating the similarity between the target vector of the question and the target vector of the question in the database comprises:
calculating the similarity between the target vector of the question and the target vector of the candidate question in the database;
wherein the candidate problem is determined by:
generating an initial vector of the question sentence according to a preset rule;
and performing similarity matching on the initial vector and the initial vector of the problem stored in the database to obtain the candidate problem.
7. The method of claim 6, wherein sub-databases corresponding to different intents are maintained in the database; before the similarity matching the initial vector with the initial vector of the question stored in the database, the method further comprises:
performing intention identification on the question, and determining the intention of the question;
correspondingly, the similarity matching of the initial vector and the initial vector of the question stored in the database includes:
and performing similarity matching on the initial vector and the initial vector of the problem in the sub-database corresponding to the intention in the database.
8. An apparatus for generating a sentence vector, the apparatus comprising:
the preprocessing module is used for preprocessing the received original sentences to obtain sentences filtered with stop words and labeled with entity labels;
a generating module, configured to process the sentence with the filtered stop word to obtain a first sentence vector, where the first sentence vector includes attribute information of characters included in the sentence with the filtered stop word; the attribute information comprises at least one of a font, a pronunciation and a stroke;
the generating module is further configured to process the sentences obtained by filtering the stop words and labeling the entity tags and the original sentences to obtain second sentence vectors, where the second sentence vectors include information of the entity tags;
the target vectors of the original sentence include the first sentence vector and the second sentence vector.
9. A matching device, characterized in that the device comprises:
a generating module, configured to process the obtained question according to the sentence vector generating method of any one of claims 1 to 4, and generate a target vector of the question;
the calculation module is used for calculating the similarity between the target vector of the question and the target vector of the question in the database and determining the target question according to the similarity; each question stored in the database has a corresponding answer;
and the returning module is used for returning the answer corresponding to the target question.
10. A storage medium, having stored thereon a computer program which, when executed by a computer, performs the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110955229.0A CN115712704A (en) | 2021-08-19 | 2021-08-19 | Sentence vector generating method and device, matching method and device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110955229.0A CN115712704A (en) | 2021-08-19 | 2021-08-19 | Sentence vector generating method and device, matching method and device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115712704A true CN115712704A (en) | 2023-02-24 |
Family
ID=85230088
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110955229.0A Pending CN115712704A (en) | 2021-08-19 | 2021-08-19 | Sentence vector generating method and device, matching method and device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115712704A (en) |
-
2021
- 2021-08-19 CN CN202110955229.0A patent/CN115712704A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230022845A1 (en) | Model for textual and numerical information retrieval in documents | |
CN107705066B (en) | Information input method and electronic equipment during commodity warehousing | |
KR102155739B1 (en) | Method, server, and system for providing chatbot service with adaptive reuse of question and answer dataset | |
CN111753060A (en) | Information retrieval method, device, equipment and computer readable storage medium | |
CN111783471B (en) | Semantic recognition method, device, equipment and storage medium for natural language | |
KR102155768B1 (en) | Method for providing question and answer data set recommendation service using adpative learning from evoloving data stream for shopping mall | |
CN111414561B (en) | Method and device for presenting information | |
US20200372025A1 (en) | Answer selection using a compare-aggregate model with language model and condensed similarity information from latent clustering | |
CN112632226B (en) | Semantic search method and device based on legal knowledge graph and electronic equipment | |
CN110866098B (en) | Machine reading method and device based on transformer and lstm and readable storage medium | |
CN112100332A (en) | Word embedding expression learning method and device and text recall method and device | |
JP2022145623A (en) | METHOD AND APPARATUS FOR PROVIDING HINT INFORMATION AND COMPUTER PROGRAM | |
CN116205212A (en) | Bid file information extraction method, device, equipment and storage medium | |
CN111523312A (en) | Paraphrase disambiguation-based query display method and device and computing equipment | |
CN111274822A (en) | Semantic matching method, device, equipment and storage medium | |
CN111079418A (en) | Named body recognition method and device, electronic equipment and storage medium | |
CN111859940A (en) | Keyword extraction method and device, electronic equipment and storage medium | |
CN114329225A (en) | Search method, device, equipment and storage medium based on search statement | |
CN111597815A (en) | Multi-embedded named entity identification method, device, equipment and storage medium | |
CN116258137A (en) | Text error correction method, device, equipment and storage medium | |
CN112000778A (en) | Natural language processing method, device and system based on semantic recognition | |
CN118113852A (en) | Financial problem answering method, device, equipment, system, medium and product | |
CN114492661B (en) | Text data classification method and device, computer equipment and storage medium | |
CN113221553A (en) | Text processing method, device and equipment and readable storage medium | |
Burdett et al. | Active transfer learning for handwriting recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |