CN113449074A

CN113449074A - Sentence vector similarity matching optimization method and device containing proper nouns and storage medium

Info

Publication number: CN113449074A
Application number: CN202110690386.3A
Authority: CN
Inventors: 张丹; 陈浩; 陈璟; 段朋; 蔡春茂
Original assignee: Chongqing Changan Automobile Co Ltd
Current assignee: Chongqing Changan Automobile Co Ltd
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-09-28

Abstract

The invention discloses a method, a device and a storage medium for optimizing similarity of sentence vectors containing proper nouns. The invention utilizes two word vector models, utilizes a small-scale professional word vector model to enhance the semantic recognition on the basis of the strong processing capacity of a large-scale open source word vector model to the universal field words, and greatly improves the semantic understanding capacity of the professional words under the condition of small language material scale in the professional field.

Description

Sentence vector similarity matching optimization method and device containing proper nouns and storage medium

Technical Field

The invention relates to a natural language processing method, in particular to a sentence vector similarity optimization method in the proprietary field.

Background

Nowadays, AI services are increasingly widely applied in various industries, and for industries or manufacturers, there are professional or proprietary terms specific to the industries or manufacturers, and problems can occur in consulting of problems in professional fields by adopting natural language processing technology.

For example, in order to improve the automobile intelligence degree, solve the problem of using the automobile by a user in real time, simultaneously liberate part of manual customer service work, and introduce an intelligent robot into the automobile remote control software for Changan automobiles. Since the intelligent robot is on line for a small security, the accumulated consultation amount reaches 600 ten thousand + people, and the consultation amount of the user approaches 3 ten thousand every month. At present, although the robot algorithm is added with an automatic training function, the actual accuracy rate is always about 70%, and a part of questions are answered in the consultation of the problems in the professional field.

The intelligent robot algorithm principle generally adopted by manufacturers is to calculate the similarity between questions asked by users and a knowledge base to match, so as to find out answers required by the users. And the calculation of the similarity is realized by the logic of word segmentation → acquisition of word vector → calculation of sentence vector → calculation of the similarity. Because the traditional language model lacks professional terms related to industries or manufacturers, such as the automobile industry or a certain automobile manufacturer, a word vector of the professional terms in the automobile industry cannot be acquired, and in the logic for calculating the sentence vector, if the word vector of a certain keyword cannot be acquired, the word vector is set to be 0 vector for processing, which is equivalent to the word being missing. Because the keywords are some professional terms in the user's problem consultation contents in the professional field, the keywords in a sentence are lost, and the accuracy rate cannot be very high no matter how the calculation method is improved. So a better way to promote sentence similarity based on word vectors is to start with optimizing word vectors.

The current word vector optimization method is based on algorithm self optimization, or adds professional vocabularies into training linguistic data, so as to train a word vector model which contains both professional terms and general fields, however, training one such model requires that the linguistic data at least reach hundred million levels, and each manufacturer, such as a vehicle enterprise, does not have such huge data volume in the general field. Therefore, there is a need in the art for an optimization method that requires less data to optimize the word vector model.

Disclosure of Invention

In view of the above, the present invention provides a method for optimizing similarity matching for sentence vectors (word vector sets) containing proper nouns, which can accurately identify the proper terms in user consultation under the condition that the size of the proper term corpus is not large by using an open-source general-field word vector model (such as an Tencent word vector model), and can convert the proper terms into corresponding word vectors, so as to improve the accuracy of sentence identification and thus improve the satisfaction of users.

The technical scheme of the invention is as follows:

the invention provides an optimization method for sentence vector similarity calculation containing proper nouns, which is a sentence vector (word vector set) similarity matching optimization method provided by the invention. The method comprises the following specific steps:

step 1: the product specification of the industry or the manufacturer, the prior consultation problem of the user and the like are arranged, and the corpus contains a large number of professional terms of the industry or the manufacturer.

Step 2: inputting the sorted linguistic data into an open source Word vector algorithm model (such as Word2Vec, Bert and the like), setting vector dimensions (the dimensions can be set to be 100 because the number of professional vocabularies is not too large), training window size, learning rate and other related parameters, running out the professional field Word vector model through the algorithm model, and storing the professional field Word vector model to obtain the professional field Word vector model _ pro. The related parameters (such as dimensions) can be dynamically set according to the corpus scale and the calculation force requirement of the proprietary field, and the setting of the specific parameters refers to a guide document provided by an algorithm model.

And step 3: acquiring knowledge points of an industry or manufacturer knowledge base, and performing necessary preprocessing on original texts of the knowledge points, wherein the preprocessing comprises word segmentation to obtain each word and removing stop words to remove unimportant words.

And 4, step 4: and inquiring the word vectors of all words by the knowledge points obtained after word segmentation and word stop through a professional word vector model _ pro, and adding the word vectors of all words to obtain a sentence vector a11 under the professional model.

And 5: and inquiring the word vectors of all words by the knowledge points obtained after the knowledge points are subjected to word segmentation and word deactivation through a general field word vector model _ gen, and adding the word vectors of all words to obtain a sentence vector a21 under the general model.

Step 6: and repeating the steps 3, 4 and 5 to obtain a professional word vector set A1 and a common word vector set A2 of all knowledge points.

And 7: acquiring user consultation content to be processed, and performing necessary preprocessing on an original text of the content, wherein the preprocessing comprises word segmentation to obtain each word and removing stop words to remove unimportant words.

And 8: and inquiring a word vector of each word by a knowledge point obtained after the user consults the content and carries out word segmentation and word deactivation through a professional word vector model _ pro, and adding the word vectors of each word to obtain a sentence vector b1 under the professional model.

And step 9: and calculating the similarity of b1 and all vectors in a vector set V1 to obtain the most similar n knowledge sets C and the corresponding similarity P1[ P10, P12 … P1n ]. The value of n is determined according to the scale of the actual knowledge base.

Step 10: and inquiring word vectors of all words through a professional word vector model _ gen according to knowledge points obtained after word segmentation and word deactivation of the user consultation contents, and adding the word vectors of all the words to obtain a sentence vector b2 under the professional model.

Step 11: and acquiring sentence vectors in a vector set V2 corresponding to the most similar n knowledge points calculated in the step 9.

Step 12: respectively calculating the similarity between b2 and the n sentence vectors in step 11 to obtain the similarity P2[ P20, P22, …, P2n ]

Step 13: and giving a similarity weight lambda to the professional word vector model, and calculating the final similarity P = lambda P1+ (1-lambda) P2.

The invention also provides a sentence vector similarity optimization device containing proper nouns, which comprises a memory and a processor, wherein the memory is stored with instructions used for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns.

The present invention further provides a machine-readable storage medium having stored thereon instructions for enabling a machine to perform the above-described proper noun-containing sentence vector similarity optimization method.

The invention utilizes two word vector models, utilizes a small-scale professional word vector model to enhance the semantic recognition on the basis of the large processing capacity of a large-scale open source word vector model to the universal field words, greatly improves the semantic understanding capacity of the professional words under the condition of small professional field corpus scale, and has the following specific advantages:

1. the professional word vector model trained by the method solves the problem that when sentence vectors are calculated by means of a universal word vector model, professional terms (such as product model names, function names, professional part names and the like) are lost, and answers are not asked. For example, for the terminology of Changan cars, the user asks' cs75plus how to use the auto park, and the results after word segmentation and word stop are: 'cs 75 plus/auto park/use'. Since cs75plus has no corresponding word vector in the universal word vector model of the Tencent cloud, the cs75plus is discarded during similarity calculation, which results in the situation of no answer to question because the pushed-out answer is an automatic parking method of escape plus.

2. The method can train out the word vector model only by small-scale professional field linguistic data of industries or manufacturers. Because the method adopts twice word vector calculation and calculates the sentence similarity by the weighted sum of the professional word vector and the general word vector, the coverage of the words in the general field can not be considered when training the professional word vector model. For example, the technical terms of the Changan automobile mainly include a model name, a function name, a special part name and the like, and the vocabulary of the Changan automobile is many orders of magnitude smaller than that of the general field, so that the language model of the special terms of the Changan automobile can be trained through the existing linguistic data of the Changan automobile, such as the materials of a driving instruction, a Changan automobile knowledge base, a user consultation record of a customer service center and the like. For example, the user inquires that the user has a good hot pot at night today, the preprocessed keyword set is 'today/night/eating/hot pot', the word vector of the keyword cannot be found through the Changan automobile professional word vector model, and therefore a chatting interface of a third party can be called according to the question-answering logic. So that consultation with non-professional fields can be normally handled.

Drawings

FIG. 1 is a logic flow diagram of the present invention.

Detailed Description

The invention is further described by taking the intelligent robot of Changan automobile as an example and combining the attached drawings of the specification:

the intelligent robot of Changan car, the small security, is a car assistant carried in the Changan remote assistant Incall APP, and is mainly used for solving the puzzles of users in the car using process and providing professional answers.

Example 1:

referring to fig. 1, the sentence vector similarity calculation optimization method used in the smart robot security comprises the following steps:

step 1, collecting and arranging Changan automobile professional linguistic data, mainly comprising driving specifications of various automobile types of the Changan automobile, a Changan automobile knowledge base, inquiry and feedback problems of the Changan automobile owner, solving measures and the like, and arranging all contents into a document type.

And 2, training a Changan professional term word vector model. In actual use, the professional term document sorted in the step 1 can be input into the algorithm model according to Word2Vec or other open source Word vector algorithm models, and a Changan automobile professional Word vector model _ pro is obtained.

And 3, preprocessing the knowledge points of the Changan knowledge base. The preprocessing mainly comprises word segmentation, word stop and the like to obtain key word groups of the knowledge points.

And step 4, inquiring and calculating a sentence vector a11 of the knowledge point through the Changan automobile professional word vector model. The specific steps are that the keyword groups processed in the step 3 are inquired about word vectors of each keyword through a Changan automobile professional word vector model, and then the word vectors are subjected to vector addition to obtain sentence vectors.

And 5, inquiring and calculating a sentence vector a21 of the knowledge point through a universal word vector model. In practical use, the general vector model may use a word vector model provided by Tencent, and query each word vector of the keyword group processed in step 3 through the general word vector model, and then add all the vectors to obtain a sentence vector.

And 6, repeating the steps 3, 4 and 5 until all sentence vectors of the knowledge points in the Changan knowledge base are processed, and respectively obtaining a professional word vector set A1 and a general word vector set A2 of all the knowledge points.

And 7, receiving the user consultation by the robot security, and preprocessing the consultation content of the user. The specific method is the same as step 3.

And 8, inquiring and calculating a sentence vector b1 consulted by the user through the Changan automobile professional word vector model, wherein the specific method is the same as that in the step 4.

And 9, inquiring and calculating a sentence vector b2 consulted by the user through the universal word vector model, wherein the specific method is the same as that in the step 5.

And 10, selecting n vectors with the highest similarity from the A1 and the b1 to obtain a knowledge point set C and a corresponding probability set P1. The method specifically comprises the steps of respectively calculating the similarity between b1 and all vectors in a vector set A1 through a cosine law, then carrying out descending order arrangement on the similarity, taking out the first n similarities, and marking as a set P1[ P10, P12 … P1n ], and marking the corresponding knowledge point set as C. The value of n can be determined according to the scale of the actual knowledge base.

Step 11, calculating a similarity set P2 of vectors corresponding to all knowledge points in b2 and the knowledge point set C [ C0, C1, …, cn ] in the vector set A2, and recording the similarity set P2 as a set [ P20, P22, …, P2n ].

And step 12, calculating the final similarity between the user consultation content and each knowledge in the knowledge base. The specific method is to give similarity lambda of the professional word vector set, and calculate the weighted sum of the two similarities, so that the final similarity between the user consultation content and the knowledge point c1 is as follows:

p0=λ*p10+(1-λ)*p20

and (5) circulating all knowledge points in the C to obtain a final similarity set P [ P0, P1,. pn ]. The value of lambda can be adjusted according to the actual situation. And sequencing the similarity in the P according to a descending order, and selecting one or more closest knowledge points for the user to consult by the robot according to subsequent logics.

Example 2:

a sentence vector similarity optimization device containing proper nouns comprises a memory and a processor, wherein the memory stores instructions used for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns.

Example 3:

a machine-readable storage medium having stored thereon instructions for enabling a machine to perform the above-described proper noun-containing sentence vector similarity optimization method.

Claims

1. A vector similarity optimization method for sentences containing proper nouns comprises the following steps:

step 1: arranging a product specification of an industry or a manufacturer and corpora of professional vocabularies of the industry or the manufacturer contained in the existing consultation problems of the user;

step 2: inputting the sorted linguistic data into an open source word vector algorithm model, and setting related parameters such as vector dimension, training window size, learning rate and the like to obtain a professional field word vector model _ pro;

and step 3: acquiring knowledge points of an industry or manufacturer knowledge base, and preprocessing an original text;

and 4, step 4: inquiring word vectors of all words from the preprocessed knowledge points through a professional word vector model _ pro, and adding the word vectors of all the words to obtain a sentence vector a11 under the professional model;

and 5: inquiring word vectors of all words from the preprocessed knowledge points through a general field word vector model _ gen, and adding the word vectors of all the words to obtain a sentence vector a21 under the general model;

step 6: repeating the steps 3, 4 and 5 to obtain sentence vectors under the professional models of all knowledge points, namely a professional word vector set A1, and sentence vectors under the general models, namely a general word vector set A2;

and 7: acquiring user consultation content to be processed, and preprocessing an original text;

and 8: inquiring the word vectors of all words from the knowledge points after the pre-processing of the user consultation contents through a professional word vector model _ pro, and adding the word vectors of all words to obtain a sentence vector b1 under the professional model;

and step 9: calculating the similarity between b1 and all vectors in a vector set A1 to obtain the most similar n knowledge sets C and the corresponding similarity P1[ P10, P12 … P1n ];

step 10: inquiring the word vectors of all words from the knowledge points after the user consultation contents are preprocessed through a professional word vector model _ gen, and adding the word vectors of all the words to obtain a sentence vector b2 under the professional model;

step 11: obtaining sentence vectors in a vector set A2 corresponding to the most similar n knowledge points calculated in the step 9;

step 12: respectively calculating the similarity between b2 and the n sentence vectors in the step 11 to obtain the similarity P2[ P20, P22, … and P2n ];

2. The method of claim 1, wherein the preprocessing comprises segmenting words to obtain words, and deactivating words to remove unimportant words.

3. The method as claimed in claim 1, wherein the parameters in step 2 are dynamically set according to the corpus size and the computation power requirement of the domain.

4. The method of claim 1, wherein the value of n is determined according to the scale of the actual knowledge base.

5. A sentence vector similarity optimization device containing proper nouns, characterized in that the device comprises a memory and a processor, wherein the memory stores instructions for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns according to any one of claims 1 to 4.

6. A machine-readable storage medium having stored thereon instructions for enabling a machine to execute the proper noun-containing sentence vector similarity optimization method according to any one of claims 1-4.