CN113449074A - Sentence vector similarity matching optimization method and device containing proper nouns and storage medium - Google Patents
Sentence vector similarity matching optimization method and device containing proper nouns and storage medium Download PDFInfo
- Publication number
- CN113449074A CN113449074A CN202110690386.3A CN202110690386A CN113449074A CN 113449074 A CN113449074 A CN 113449074A CN 202110690386 A CN202110690386 A CN 202110690386A CN 113449074 A CN113449074 A CN 113449074A
- Authority
- CN
- China
- Prior art keywords
- professional
- vector
- word
- words
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method, a device and a storage medium for optimizing similarity of sentence vectors containing proper nouns. The invention utilizes two word vector models, utilizes a small-scale professional word vector model to enhance the semantic recognition on the basis of the strong processing capacity of a large-scale open source word vector model to the universal field words, and greatly improves the semantic understanding capacity of the professional words under the condition of small language material scale in the professional field.
Description
Technical Field
The invention relates to a natural language processing method, in particular to a sentence vector similarity optimization method in the proprietary field.
Background
Nowadays, AI services are increasingly widely applied in various industries, and for industries or manufacturers, there are professional or proprietary terms specific to the industries or manufacturers, and problems can occur in consulting of problems in professional fields by adopting natural language processing technology.
For example, in order to improve the automobile intelligence degree, solve the problem of using the automobile by a user in real time, simultaneously liberate part of manual customer service work, and introduce an intelligent robot into the automobile remote control software for Changan automobiles. Since the intelligent robot is on line for a small security, the accumulated consultation amount reaches 600 ten thousand + people, and the consultation amount of the user approaches 3 ten thousand every month. At present, although the robot algorithm is added with an automatic training function, the actual accuracy rate is always about 70%, and a part of questions are answered in the consultation of the problems in the professional field.
The intelligent robot algorithm principle generally adopted by manufacturers is to calculate the similarity between questions asked by users and a knowledge base to match, so as to find out answers required by the users. And the calculation of the similarity is realized by the logic of word segmentation → acquisition of word vector → calculation of sentence vector → calculation of the similarity. Because the traditional language model lacks professional terms related to industries or manufacturers, such as the automobile industry or a certain automobile manufacturer, a word vector of the professional terms in the automobile industry cannot be acquired, and in the logic for calculating the sentence vector, if the word vector of a certain keyword cannot be acquired, the word vector is set to be 0 vector for processing, which is equivalent to the word being missing. Because the keywords are some professional terms in the user's problem consultation contents in the professional field, the keywords in a sentence are lost, and the accuracy rate cannot be very high no matter how the calculation method is improved. So a better way to promote sentence similarity based on word vectors is to start with optimizing word vectors.
The current word vector optimization method is based on algorithm self optimization, or adds professional vocabularies into training linguistic data, so as to train a word vector model which contains both professional terms and general fields, however, training one such model requires that the linguistic data at least reach hundred million levels, and each manufacturer, such as a vehicle enterprise, does not have such huge data volume in the general field. Therefore, there is a need in the art for an optimization method that requires less data to optimize the word vector model.
Disclosure of Invention
In view of the above, the present invention provides a method for optimizing similarity matching for sentence vectors (word vector sets) containing proper nouns, which can accurately identify the proper terms in user consultation under the condition that the size of the proper term corpus is not large by using an open-source general-field word vector model (such as an Tencent word vector model), and can convert the proper terms into corresponding word vectors, so as to improve the accuracy of sentence identification and thus improve the satisfaction of users.
The technical scheme of the invention is as follows:
the invention provides an optimization method for sentence vector similarity calculation containing proper nouns, which is a sentence vector (word vector set) similarity matching optimization method provided by the invention. The method comprises the following specific steps:
step 1: the product specification of the industry or the manufacturer, the prior consultation problem of the user and the like are arranged, and the corpus contains a large number of professional terms of the industry or the manufacturer.
Step 2: inputting the sorted linguistic data into an open source Word vector algorithm model (such as Word2Vec, Bert and the like), setting vector dimensions (the dimensions can be set to be 100 because the number of professional vocabularies is not too large), training window size, learning rate and other related parameters, running out the professional field Word vector model through the algorithm model, and storing the professional field Word vector model to obtain the professional field Word vector model _ pro. The related parameters (such as dimensions) can be dynamically set according to the corpus scale and the calculation force requirement of the proprietary field, and the setting of the specific parameters refers to a guide document provided by an algorithm model.
And step 3: acquiring knowledge points of an industry or manufacturer knowledge base, and performing necessary preprocessing on original texts of the knowledge points, wherein the preprocessing comprises word segmentation to obtain each word and removing stop words to remove unimportant words.
And 4, step 4: and inquiring the word vectors of all words by the knowledge points obtained after word segmentation and word stop through a professional word vector model _ pro, and adding the word vectors of all words to obtain a sentence vector a11 under the professional model.
And 5: and inquiring the word vectors of all words by the knowledge points obtained after the knowledge points are subjected to word segmentation and word deactivation through a general field word vector model _ gen, and adding the word vectors of all words to obtain a sentence vector a21 under the general model.
Step 6: and repeating the steps 3, 4 and 5 to obtain a professional word vector set A1 and a common word vector set A2 of all knowledge points.
And 7: acquiring user consultation content to be processed, and performing necessary preprocessing on an original text of the content, wherein the preprocessing comprises word segmentation to obtain each word and removing stop words to remove unimportant words.
And 8: and inquiring a word vector of each word by a knowledge point obtained after the user consults the content and carries out word segmentation and word deactivation through a professional word vector model _ pro, and adding the word vectors of each word to obtain a sentence vector b1 under the professional model.
And step 9: and calculating the similarity of b1 and all vectors in a vector set V1 to obtain the most similar n knowledge sets C and the corresponding similarity P1[ P10, P12 … P1n ]. The value of n is determined according to the scale of the actual knowledge base.
Step 10: and inquiring word vectors of all words through a professional word vector model _ gen according to knowledge points obtained after word segmentation and word deactivation of the user consultation contents, and adding the word vectors of all the words to obtain a sentence vector b2 under the professional model.
Step 11: and acquiring sentence vectors in a vector set V2 corresponding to the most similar n knowledge points calculated in the step 9.
Step 12: respectively calculating the similarity between b2 and the n sentence vectors in step 11 to obtain the similarity P2[ P20, P22, …, P2n ]
Step 13: and giving a similarity weight lambda to the professional word vector model, and calculating the final similarity P = lambda P1+ (1-lambda) P2.
The invention also provides a sentence vector similarity optimization device containing proper nouns, which comprises a memory and a processor, wherein the memory is stored with instructions used for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns.
The present invention further provides a machine-readable storage medium having stored thereon instructions for enabling a machine to perform the above-described proper noun-containing sentence vector similarity optimization method.
The invention utilizes two word vector models, utilizes a small-scale professional word vector model to enhance the semantic recognition on the basis of the large processing capacity of a large-scale open source word vector model to the universal field words, greatly improves the semantic understanding capacity of the professional words under the condition of small professional field corpus scale, and has the following specific advantages:
1. the professional word vector model trained by the method solves the problem that when sentence vectors are calculated by means of a universal word vector model, professional terms (such as product model names, function names, professional part names and the like) are lost, and answers are not asked. For example, for the terminology of Changan cars, the user asks' cs75plus how to use the auto park, and the results after word segmentation and word stop are: 'cs 75 plus/auto park/use'. Since cs75plus has no corresponding word vector in the universal word vector model of the Tencent cloud, the cs75plus is discarded during similarity calculation, which results in the situation of no answer to question because the pushed-out answer is an automatic parking method of escape plus.
2. The method can train out the word vector model only by small-scale professional field linguistic data of industries or manufacturers. Because the method adopts twice word vector calculation and calculates the sentence similarity by the weighted sum of the professional word vector and the general word vector, the coverage of the words in the general field can not be considered when training the professional word vector model. For example, the technical terms of the Changan automobile mainly include a model name, a function name, a special part name and the like, and the vocabulary of the Changan automobile is many orders of magnitude smaller than that of the general field, so that the language model of the special terms of the Changan automobile can be trained through the existing linguistic data of the Changan automobile, such as the materials of a driving instruction, a Changan automobile knowledge base, a user consultation record of a customer service center and the like. For example, the user inquires that the user has a good hot pot at night today, the preprocessed keyword set is 'today/night/eating/hot pot', the word vector of the keyword cannot be found through the Changan automobile professional word vector model, and therefore a chatting interface of a third party can be called according to the question-answering logic. So that consultation with non-professional fields can be normally handled.
Drawings
FIG. 1 is a logic flow diagram of the present invention.
Detailed Description
The invention is further described by taking the intelligent robot of Changan automobile as an example and combining the attached drawings of the specification:
the intelligent robot of Changan car, the small security, is a car assistant carried in the Changan remote assistant Incall APP, and is mainly used for solving the puzzles of users in the car using process and providing professional answers.
Example 1:
referring to fig. 1, the sentence vector similarity calculation optimization method used in the smart robot security comprises the following steps:
step 1, collecting and arranging Changan automobile professional linguistic data, mainly comprising driving specifications of various automobile types of the Changan automobile, a Changan automobile knowledge base, inquiry and feedback problems of the Changan automobile owner, solving measures and the like, and arranging all contents into a document type.
And 2, training a Changan professional term word vector model. In actual use, the professional term document sorted in the step 1 can be input into the algorithm model according to Word2Vec or other open source Word vector algorithm models, and a Changan automobile professional Word vector model _ pro is obtained.
And 3, preprocessing the knowledge points of the Changan knowledge base. The preprocessing mainly comprises word segmentation, word stop and the like to obtain key word groups of the knowledge points.
And step 4, inquiring and calculating a sentence vector a11 of the knowledge point through the Changan automobile professional word vector model. The specific steps are that the keyword groups processed in the step 3 are inquired about word vectors of each keyword through a Changan automobile professional word vector model, and then the word vectors are subjected to vector addition to obtain sentence vectors.
And 5, inquiring and calculating a sentence vector a21 of the knowledge point through a universal word vector model. In practical use, the general vector model may use a word vector model provided by Tencent, and query each word vector of the keyword group processed in step 3 through the general word vector model, and then add all the vectors to obtain a sentence vector.
And 6, repeating the steps 3, 4 and 5 until all sentence vectors of the knowledge points in the Changan knowledge base are processed, and respectively obtaining a professional word vector set A1 and a general word vector set A2 of all the knowledge points.
And 7, receiving the user consultation by the robot security, and preprocessing the consultation content of the user. The specific method is the same as step 3.
And 8, inquiring and calculating a sentence vector b1 consulted by the user through the Changan automobile professional word vector model, wherein the specific method is the same as that in the step 4.
And 9, inquiring and calculating a sentence vector b2 consulted by the user through the universal word vector model, wherein the specific method is the same as that in the step 5.
And 10, selecting n vectors with the highest similarity from the A1 and the b1 to obtain a knowledge point set C and a corresponding probability set P1. The method specifically comprises the steps of respectively calculating the similarity between b1 and all vectors in a vector set A1 through a cosine law, then carrying out descending order arrangement on the similarity, taking out the first n similarities, and marking as a set P1[ P10, P12 … P1n ], and marking the corresponding knowledge point set as C. The value of n can be determined according to the scale of the actual knowledge base.
Step 11, calculating a similarity set P2 of vectors corresponding to all knowledge points in b2 and the knowledge point set C [ C0, C1, …, cn ] in the vector set A2, and recording the similarity set P2 as a set [ P20, P22, …, P2n ].
And step 12, calculating the final similarity between the user consultation content and each knowledge in the knowledge base. The specific method is to give similarity lambda of the professional word vector set, and calculate the weighted sum of the two similarities, so that the final similarity between the user consultation content and the knowledge point c1 is as follows:
p0=λ*p10+(1-λ)*p20
and (5) circulating all knowledge points in the C to obtain a final similarity set P [ P0, P1,. pn ]. The value of lambda can be adjusted according to the actual situation. And sequencing the similarity in the P according to a descending order, and selecting one or more closest knowledge points for the user to consult by the robot according to subsequent logics.
Example 2:
a sentence vector similarity optimization device containing proper nouns comprises a memory and a processor, wherein the memory stores instructions used for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns.
Example 3:
a machine-readable storage medium having stored thereon instructions for enabling a machine to perform the above-described proper noun-containing sentence vector similarity optimization method.
Claims (6)
1. A vector similarity optimization method for sentences containing proper nouns comprises the following steps:
step 1: arranging a product specification of an industry or a manufacturer and corpora of professional vocabularies of the industry or the manufacturer contained in the existing consultation problems of the user;
step 2: inputting the sorted linguistic data into an open source word vector algorithm model, and setting related parameters such as vector dimension, training window size, learning rate and the like to obtain a professional field word vector model _ pro;
and step 3: acquiring knowledge points of an industry or manufacturer knowledge base, and preprocessing an original text;
and 4, step 4: inquiring word vectors of all words from the preprocessed knowledge points through a professional word vector model _ pro, and adding the word vectors of all the words to obtain a sentence vector a11 under the professional model;
and 5: inquiring word vectors of all words from the preprocessed knowledge points through a general field word vector model _ gen, and adding the word vectors of all the words to obtain a sentence vector a21 under the general model;
step 6: repeating the steps 3, 4 and 5 to obtain sentence vectors under the professional models of all knowledge points, namely a professional word vector set A1, and sentence vectors under the general models, namely a general word vector set A2;
and 7: acquiring user consultation content to be processed, and preprocessing an original text;
and 8: inquiring the word vectors of all words from the knowledge points after the pre-processing of the user consultation contents through a professional word vector model _ pro, and adding the word vectors of all words to obtain a sentence vector b1 under the professional model;
and step 9: calculating the similarity between b1 and all vectors in a vector set A1 to obtain the most similar n knowledge sets C and the corresponding similarity P1[ P10, P12 … P1n ];
step 10: inquiring the word vectors of all words from the knowledge points after the user consultation contents are preprocessed through a professional word vector model _ gen, and adding the word vectors of all the words to obtain a sentence vector b2 under the professional model;
step 11: obtaining sentence vectors in a vector set A2 corresponding to the most similar n knowledge points calculated in the step 9;
step 12: respectively calculating the similarity between b2 and the n sentence vectors in the step 11 to obtain the similarity P2[ P20, P22, … and P2n ];
step 13: and giving a similarity weight lambda to the professional word vector model, and calculating the final similarity P = lambda P1+ (1-lambda) P2.
2. The method of claim 1, wherein the preprocessing comprises segmenting words to obtain words, and deactivating words to remove unimportant words.
3. The method as claimed in claim 1, wherein the parameters in step 2 are dynamically set according to the corpus size and the computation power requirement of the domain.
4. The method of claim 1, wherein the value of n is determined according to the scale of the actual knowledge base.
5. A sentence vector similarity optimization device containing proper nouns, characterized in that the device comprises a memory and a processor, wherein the memory stores instructions for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns according to any one of claims 1 to 4.
6. A machine-readable storage medium having stored thereon instructions for enabling a machine to execute the proper noun-containing sentence vector similarity optimization method according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110690386.3A CN113449074A (en) | 2021-06-22 | 2021-06-22 | Sentence vector similarity matching optimization method and device containing proper nouns and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110690386.3A CN113449074A (en) | 2021-06-22 | 2021-06-22 | Sentence vector similarity matching optimization method and device containing proper nouns and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113449074A true CN113449074A (en) | 2021-09-28 |
Family
ID=77812114
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110690386.3A Withdrawn CN113449074A (en) | 2021-06-22 | 2021-06-22 | Sentence vector similarity matching optimization method and device containing proper nouns and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113449074A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009064051A (en) * | 2007-09-04 | 2009-03-26 | National Institute Of Information & Communication Technology | Information processor, information processing method and program |
US20120203539A1 (en) * | 2011-02-08 | 2012-08-09 | Microsoft Corporation | Selection of domain-adapted translation subcorpora |
CN108304439A (en) * | 2017-10-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of semantic model optimization method, device and smart machine, storage medium |
CN109359302A (en) * | 2018-10-26 | 2019-02-19 | 重庆大学 | A kind of optimization method of field term vector and fusion sort method based on it |
CN109960815A (en) * | 2019-03-27 | 2019-07-02 | 河南大学 | A kind of creation method and system of nerve machine translation NMT model |
CN111222327A (en) * | 2019-12-23 | 2020-06-02 | 东软集团股份有限公司 | Word embedding representation method, device and equipment |
CN111539197A (en) * | 2020-04-15 | 2020-08-14 | 北京百度网讯科技有限公司 | Text matching method and device, computer system and readable storage medium |
-
2021
- 2021-06-22 CN CN202110690386.3A patent/CN113449074A/en not_active Withdrawn
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009064051A (en) * | 2007-09-04 | 2009-03-26 | National Institute Of Information & Communication Technology | Information processor, information processing method and program |
US20120203539A1 (en) * | 2011-02-08 | 2012-08-09 | Microsoft Corporation | Selection of domain-adapted translation subcorpora |
CN108304439A (en) * | 2017-10-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of semantic model optimization method, device and smart machine, storage medium |
CN109359302A (en) * | 2018-10-26 | 2019-02-19 | 重庆大学 | A kind of optimization method of field term vector and fusion sort method based on it |
CN109960815A (en) * | 2019-03-27 | 2019-07-02 | 河南大学 | A kind of creation method and system of nerve machine translation NMT model |
CN111222327A (en) * | 2019-12-23 | 2020-06-02 | 东软集团股份有限公司 | Word embedding representation method, device and equipment |
CN111539197A (en) * | 2020-04-15 | 2020-08-14 | 北京百度网讯科技有限公司 | Text matching method and device, computer system and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110162611A (en) | A kind of intelligent customer service answer method and system | |
CN110990543A (en) | Intelligent conversation generation method and device, computer equipment and computer storage medium | |
CN111062220B (en) | End-to-end intention recognition system and method based on memory forgetting device | |
CN112035640A (en) | Refined question-answering method based on intelligent question-answering robot, storage medium and intelligent equipment | |
CN111858854B (en) | Question-answer matching method and relevant device based on historical dialogue information | |
CN110321564B (en) | Multi-round dialogue intention recognition method | |
CN111611382A (en) | Dialect model training method, dialog information generation method, device and system | |
CN106844344B (en) | Contribution calculation method for conversation and theme extraction method and system | |
CN110597966A (en) | Automatic question answering method and device | |
CN111309891B (en) | System for reading robot to automatically ask and answer questions and application method thereof | |
CN114757176A (en) | Method for obtaining target intention recognition model and intention recognition method | |
CN114970560A (en) | Dialog intention recognition method and device, storage medium and intelligent device | |
CN113590778A (en) | Intelligent customer service intention understanding method, device, equipment and storage medium | |
CN114678014A (en) | Intention recognition method, device, computer equipment and computer readable storage medium | |
CN111738018A (en) | Intention understanding method, device, equipment and storage medium | |
CN113342958A (en) | Question-answer matching method, text matching model training method and related equipment | |
CN113704444A (en) | Question-answering method, system, equipment and storage medium based on natural language processing | |
CN114238373A (en) | Method and device for converting natural language question into structured query statement | |
CN113297365B (en) | User intention judging method, device, equipment and storage medium | |
CN115345177A (en) | Intention recognition model training method and dialogue method and device | |
CN113449074A (en) | Sentence vector similarity matching optimization method and device containing proper nouns and storage medium | |
CN114510561A (en) | Answer selection method, device, equipment and storage medium | |
CN114186048A (en) | Question-answer replying method and device based on artificial intelligence, computer equipment and medium | |
CN114116975A (en) | Multi-intention identification method and system | |
CN114722830A (en) | Intelligent customer service semantic recognition general model construction method and question-answering robot |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210928 |