CN113449074A - Sentence vector similarity matching optimization method and device containing proper nouns and storage medium - Google Patents

Sentence vector similarity matching optimization method and device containing proper nouns and storage medium Download PDF

Info

Publication number
CN113449074A
CN113449074A CN202110690386.3A CN202110690386A CN113449074A CN 113449074 A CN113449074 A CN 113449074A CN 202110690386 A CN202110690386 A CN 202110690386A CN 113449074 A CN113449074 A CN 113449074A
Authority
CN
China
Prior art keywords
professional
vector
word
words
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110690386.3A
Other languages
Chinese (zh)
Inventor
张丹
陈浩
陈璟
段朋
蔡春茂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Changan Automobile Co Ltd
Original Assignee
Chongqing Changan Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Changan Automobile Co Ltd filed Critical Chongqing Changan Automobile Co Ltd
Priority to CN202110690386.3A priority Critical patent/CN113449074A/en
Publication of CN113449074A publication Critical patent/CN113449074A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a device and a storage medium for optimizing similarity of sentence vectors containing proper nouns. The invention utilizes two word vector models, utilizes a small-scale professional word vector model to enhance the semantic recognition on the basis of the strong processing capacity of a large-scale open source word vector model to the universal field words, and greatly improves the semantic understanding capacity of the professional words under the condition of small language material scale in the professional field.

Description

Sentence vector similarity matching optimization method and device containing proper nouns and storage medium
Technical Field
The invention relates to a natural language processing method, in particular to a sentence vector similarity optimization method in the proprietary field.
Background
Nowadays, AI services are increasingly widely applied in various industries, and for industries or manufacturers, there are professional or proprietary terms specific to the industries or manufacturers, and problems can occur in consulting of problems in professional fields by adopting natural language processing technology.
For example, in order to improve the automobile intelligence degree, solve the problem of using the automobile by a user in real time, simultaneously liberate part of manual customer service work, and introduce an intelligent robot into the automobile remote control software for Changan automobiles. Since the intelligent robot is on line for a small security, the accumulated consultation amount reaches 600 ten thousand + people, and the consultation amount of the user approaches 3 ten thousand every month. At present, although the robot algorithm is added with an automatic training function, the actual accuracy rate is always about 70%, and a part of questions are answered in the consultation of the problems in the professional field.
The intelligent robot algorithm principle generally adopted by manufacturers is to calculate the similarity between questions asked by users and a knowledge base to match, so as to find out answers required by the users. And the calculation of the similarity is realized by the logic of word segmentation → acquisition of word vector → calculation of sentence vector → calculation of the similarity. Because the traditional language model lacks professional terms related to industries or manufacturers, such as the automobile industry or a certain automobile manufacturer, a word vector of the professional terms in the automobile industry cannot be acquired, and in the logic for calculating the sentence vector, if the word vector of a certain keyword cannot be acquired, the word vector is set to be 0 vector for processing, which is equivalent to the word being missing. Because the keywords are some professional terms in the user's problem consultation contents in the professional field, the keywords in a sentence are lost, and the accuracy rate cannot be very high no matter how the calculation method is improved. So a better way to promote sentence similarity based on word vectors is to start with optimizing word vectors.
The current word vector optimization method is based on algorithm self optimization, or adds professional vocabularies into training linguistic data, so as to train a word vector model which contains both professional terms and general fields, however, training one such model requires that the linguistic data at least reach hundred million levels, and each manufacturer, such as a vehicle enterprise, does not have such huge data volume in the general field. Therefore, there is a need in the art for an optimization method that requires less data to optimize the word vector model.
Disclosure of Invention
In view of the above, the present invention provides a method for optimizing similarity matching for sentence vectors (word vector sets) containing proper nouns, which can accurately identify the proper terms in user consultation under the condition that the size of the proper term corpus is not large by using an open-source general-field word vector model (such as an Tencent word vector model), and can convert the proper terms into corresponding word vectors, so as to improve the accuracy of sentence identification and thus improve the satisfaction of users.
The technical scheme of the invention is as follows:
the invention provides an optimization method for sentence vector similarity calculation containing proper nouns, which is a sentence vector (word vector set) similarity matching optimization method provided by the invention. The method comprises the following specific steps:
step 1: the product specification of the industry or the manufacturer, the prior consultation problem of the user and the like are arranged, and the corpus contains a large number of professional terms of the industry or the manufacturer.
Step 2: inputting the sorted linguistic data into an open source Word vector algorithm model (such as Word2Vec, Bert and the like), setting vector dimensions (the dimensions can be set to be 100 because the number of professional vocabularies is not too large), training window size, learning rate and other related parameters, running out the professional field Word vector model through the algorithm model, and storing the professional field Word vector model to obtain the professional field Word vector model _ pro. The related parameters (such as dimensions) can be dynamically set according to the corpus scale and the calculation force requirement of the proprietary field, and the setting of the specific parameters refers to a guide document provided by an algorithm model.
And step 3: acquiring knowledge points of an industry or manufacturer knowledge base, and performing necessary preprocessing on original texts of the knowledge points, wherein the preprocessing comprises word segmentation to obtain each word and removing stop words to remove unimportant words.
And 4, step 4: and inquiring the word vectors of all words by the knowledge points obtained after word segmentation and word stop through a professional word vector model _ pro, and adding the word vectors of all words to obtain a sentence vector a11 under the professional model.
And 5: and inquiring the word vectors of all words by the knowledge points obtained after the knowledge points are subjected to word segmentation and word deactivation through a general field word vector model _ gen, and adding the word vectors of all words to obtain a sentence vector a21 under the general model.
Step 6: and repeating the steps 3, 4 and 5 to obtain a professional word vector set A1 and a common word vector set A2 of all knowledge points.
And 7: acquiring user consultation content to be processed, and performing necessary preprocessing on an original text of the content, wherein the preprocessing comprises word segmentation to obtain each word and removing stop words to remove unimportant words.
And 8: and inquiring a word vector of each word by a knowledge point obtained after the user consults the content and carries out word segmentation and word deactivation through a professional word vector model _ pro, and adding the word vectors of each word to obtain a sentence vector b1 under the professional model.
And step 9: and calculating the similarity of b1 and all vectors in a vector set V1 to obtain the most similar n knowledge sets C and the corresponding similarity P1[ P10, P12 … P1n ]. The value of n is determined according to the scale of the actual knowledge base.
Step 10: and inquiring word vectors of all words through a professional word vector model _ gen according to knowledge points obtained after word segmentation and word deactivation of the user consultation contents, and adding the word vectors of all the words to obtain a sentence vector b2 under the professional model.
Step 11: and acquiring sentence vectors in a vector set V2 corresponding to the most similar n knowledge points calculated in the step 9.
Step 12: respectively calculating the similarity between b2 and the n sentence vectors in step 11 to obtain the similarity P2[ P20, P22, …, P2n ]
Step 13: and giving a similarity weight lambda to the professional word vector model, and calculating the final similarity P = lambda P1+ (1-lambda) P2.
The invention also provides a sentence vector similarity optimization device containing proper nouns, which comprises a memory and a processor, wherein the memory is stored with instructions used for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns.
The present invention further provides a machine-readable storage medium having stored thereon instructions for enabling a machine to perform the above-described proper noun-containing sentence vector similarity optimization method.
The invention utilizes two word vector models, utilizes a small-scale professional word vector model to enhance the semantic recognition on the basis of the large processing capacity of a large-scale open source word vector model to the universal field words, greatly improves the semantic understanding capacity of the professional words under the condition of small professional field corpus scale, and has the following specific advantages:
1. the professional word vector model trained by the method solves the problem that when sentence vectors are calculated by means of a universal word vector model, professional terms (such as product model names, function names, professional part names and the like) are lost, and answers are not asked. For example, for the terminology of Changan cars, the user asks' cs75plus how to use the auto park, and the results after word segmentation and word stop are: 'cs 75 plus/auto park/use'. Since cs75plus has no corresponding word vector in the universal word vector model of the Tencent cloud, the cs75plus is discarded during similarity calculation, which results in the situation of no answer to question because the pushed-out answer is an automatic parking method of escape plus.
2. The method can train out the word vector model only by small-scale professional field linguistic data of industries or manufacturers. Because the method adopts twice word vector calculation and calculates the sentence similarity by the weighted sum of the professional word vector and the general word vector, the coverage of the words in the general field can not be considered when training the professional word vector model. For example, the technical terms of the Changan automobile mainly include a model name, a function name, a special part name and the like, and the vocabulary of the Changan automobile is many orders of magnitude smaller than that of the general field, so that the language model of the special terms of the Changan automobile can be trained through the existing linguistic data of the Changan automobile, such as the materials of a driving instruction, a Changan automobile knowledge base, a user consultation record of a customer service center and the like. For example, the user inquires that the user has a good hot pot at night today, the preprocessed keyword set is 'today/night/eating/hot pot', the word vector of the keyword cannot be found through the Changan automobile professional word vector model, and therefore a chatting interface of a third party can be called according to the question-answering logic. So that consultation with non-professional fields can be normally handled.
Drawings
FIG. 1 is a logic flow diagram of the present invention.
Detailed Description
The invention is further described by taking the intelligent robot of Changan automobile as an example and combining the attached drawings of the specification:
the intelligent robot of Changan car, the small security, is a car assistant carried in the Changan remote assistant Incall APP, and is mainly used for solving the puzzles of users in the car using process and providing professional answers.
Example 1:
referring to fig. 1, the sentence vector similarity calculation optimization method used in the smart robot security comprises the following steps:
step 1, collecting and arranging Changan automobile professional linguistic data, mainly comprising driving specifications of various automobile types of the Changan automobile, a Changan automobile knowledge base, inquiry and feedback problems of the Changan automobile owner, solving measures and the like, and arranging all contents into a document type.
And 2, training a Changan professional term word vector model. In actual use, the professional term document sorted in the step 1 can be input into the algorithm model according to Word2Vec or other open source Word vector algorithm models, and a Changan automobile professional Word vector model _ pro is obtained.
And 3, preprocessing the knowledge points of the Changan knowledge base. The preprocessing mainly comprises word segmentation, word stop and the like to obtain key word groups of the knowledge points.
And step 4, inquiring and calculating a sentence vector a11 of the knowledge point through the Changan automobile professional word vector model. The specific steps are that the keyword groups processed in the step 3 are inquired about word vectors of each keyword through a Changan automobile professional word vector model, and then the word vectors are subjected to vector addition to obtain sentence vectors.
And 5, inquiring and calculating a sentence vector a21 of the knowledge point through a universal word vector model. In practical use, the general vector model may use a word vector model provided by Tencent, and query each word vector of the keyword group processed in step 3 through the general word vector model, and then add all the vectors to obtain a sentence vector.
And 6, repeating the steps 3, 4 and 5 until all sentence vectors of the knowledge points in the Changan knowledge base are processed, and respectively obtaining a professional word vector set A1 and a general word vector set A2 of all the knowledge points.
And 7, receiving the user consultation by the robot security, and preprocessing the consultation content of the user. The specific method is the same as step 3.
And 8, inquiring and calculating a sentence vector b1 consulted by the user through the Changan automobile professional word vector model, wherein the specific method is the same as that in the step 4.
And 9, inquiring and calculating a sentence vector b2 consulted by the user through the universal word vector model, wherein the specific method is the same as that in the step 5.
And 10, selecting n vectors with the highest similarity from the A1 and the b1 to obtain a knowledge point set C and a corresponding probability set P1. The method specifically comprises the steps of respectively calculating the similarity between b1 and all vectors in a vector set A1 through a cosine law, then carrying out descending order arrangement on the similarity, taking out the first n similarities, and marking as a set P1[ P10, P12 … P1n ], and marking the corresponding knowledge point set as C. The value of n can be determined according to the scale of the actual knowledge base.
Step 11, calculating a similarity set P2 of vectors corresponding to all knowledge points in b2 and the knowledge point set C [ C0, C1, …, cn ] in the vector set A2, and recording the similarity set P2 as a set [ P20, P22, …, P2n ].
And step 12, calculating the final similarity between the user consultation content and each knowledge in the knowledge base. The specific method is to give similarity lambda of the professional word vector set, and calculate the weighted sum of the two similarities, so that the final similarity between the user consultation content and the knowledge point c1 is as follows:
p0=λ*p10+(1-λ)*p20
and (5) circulating all knowledge points in the C to obtain a final similarity set P [ P0, P1,. pn ]. The value of lambda can be adjusted according to the actual situation. And sequencing the similarity in the P according to a descending order, and selecting one or more closest knowledge points for the user to consult by the robot according to subsequent logics.
Example 2:
a sentence vector similarity optimization device containing proper nouns comprises a memory and a processor, wherein the memory stores instructions used for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns.
Example 3:
a machine-readable storage medium having stored thereon instructions for enabling a machine to perform the above-described proper noun-containing sentence vector similarity optimization method.

Claims (6)

1. A vector similarity optimization method for sentences containing proper nouns comprises the following steps:
step 1: arranging a product specification of an industry or a manufacturer and corpora of professional vocabularies of the industry or the manufacturer contained in the existing consultation problems of the user;
step 2: inputting the sorted linguistic data into an open source word vector algorithm model, and setting related parameters such as vector dimension, training window size, learning rate and the like to obtain a professional field word vector model _ pro;
and step 3: acquiring knowledge points of an industry or manufacturer knowledge base, and preprocessing an original text;
and 4, step 4: inquiring word vectors of all words from the preprocessed knowledge points through a professional word vector model _ pro, and adding the word vectors of all the words to obtain a sentence vector a11 under the professional model;
and 5: inquiring word vectors of all words from the preprocessed knowledge points through a general field word vector model _ gen, and adding the word vectors of all the words to obtain a sentence vector a21 under the general model;
step 6: repeating the steps 3, 4 and 5 to obtain sentence vectors under the professional models of all knowledge points, namely a professional word vector set A1, and sentence vectors under the general models, namely a general word vector set A2;
and 7: acquiring user consultation content to be processed, and preprocessing an original text;
and 8: inquiring the word vectors of all words from the knowledge points after the pre-processing of the user consultation contents through a professional word vector model _ pro, and adding the word vectors of all words to obtain a sentence vector b1 under the professional model;
and step 9: calculating the similarity between b1 and all vectors in a vector set A1 to obtain the most similar n knowledge sets C and the corresponding similarity P1[ P10, P12 … P1n ];
step 10: inquiring the word vectors of all words from the knowledge points after the user consultation contents are preprocessed through a professional word vector model _ gen, and adding the word vectors of all the words to obtain a sentence vector b2 under the professional model;
step 11: obtaining sentence vectors in a vector set A2 corresponding to the most similar n knowledge points calculated in the step 9;
step 12: respectively calculating the similarity between b2 and the n sentence vectors in the step 11 to obtain the similarity P2[ P20, P22, … and P2n ];
step 13: and giving a similarity weight lambda to the professional word vector model, and calculating the final similarity P = lambda P1+ (1-lambda) P2.
2. The method of claim 1, wherein the preprocessing comprises segmenting words to obtain words, and deactivating words to remove unimportant words.
3. The method as claimed in claim 1, wherein the parameters in step 2 are dynamically set according to the corpus size and the computation power requirement of the domain.
4. The method of claim 1, wherein the value of n is determined according to the scale of the actual knowledge base.
5. A sentence vector similarity optimization device containing proper nouns, characterized in that the device comprises a memory and a processor, wherein the memory stores instructions for enabling the processor to execute the sentence vector similarity optimization method containing proper nouns according to any one of claims 1 to 4.
6. A machine-readable storage medium having stored thereon instructions for enabling a machine to execute the proper noun-containing sentence vector similarity optimization method according to any one of claims 1-4.
CN202110690386.3A 2021-06-22 2021-06-22 Sentence vector similarity matching optimization method and device containing proper nouns and storage medium Withdrawn CN113449074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110690386.3A CN113449074A (en) 2021-06-22 2021-06-22 Sentence vector similarity matching optimization method and device containing proper nouns and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110690386.3A CN113449074A (en) 2021-06-22 2021-06-22 Sentence vector similarity matching optimization method and device containing proper nouns and storage medium

Publications (1)

Publication Number Publication Date
CN113449074A true CN113449074A (en) 2021-09-28

Family

ID=77812114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110690386.3A Withdrawn CN113449074A (en) 2021-06-22 2021-06-22 Sentence vector similarity matching optimization method and device containing proper nouns and storage medium

Country Status (1)

Country Link
CN (1) CN113449074A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009064051A (en) * 2007-09-04 2009-03-26 National Institute Of Information & Communication Technology Information processor, information processing method and program
US20120203539A1 (en) * 2011-02-08 2012-08-09 Microsoft Corporation Selection of domain-adapted translation subcorpora
CN108304439A (en) * 2017-10-30 2018-07-20 腾讯科技(深圳)有限公司 A kind of semantic model optimization method, device and smart machine, storage medium
CN109359302A (en) * 2018-10-26 2019-02-19 重庆大学 A kind of optimization method of field term vector and fusion sort method based on it
CN109960815A (en) * 2019-03-27 2019-07-02 河南大学 A kind of creation method and system of nerve machine translation NMT model
CN111222327A (en) * 2019-12-23 2020-06-02 东软集团股份有限公司 Word embedding representation method, device and equipment
CN111539197A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Text matching method and device, computer system and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009064051A (en) * 2007-09-04 2009-03-26 National Institute Of Information & Communication Technology Information processor, information processing method and program
US20120203539A1 (en) * 2011-02-08 2012-08-09 Microsoft Corporation Selection of domain-adapted translation subcorpora
CN108304439A (en) * 2017-10-30 2018-07-20 腾讯科技(深圳)有限公司 A kind of semantic model optimization method, device and smart machine, storage medium
CN109359302A (en) * 2018-10-26 2019-02-19 重庆大学 A kind of optimization method of field term vector and fusion sort method based on it
CN109960815A (en) * 2019-03-27 2019-07-02 河南大学 A kind of creation method and system of nerve machine translation NMT model
CN111222327A (en) * 2019-12-23 2020-06-02 东软集团股份有限公司 Word embedding representation method, device and equipment
CN111539197A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Text matching method and device, computer system and readable storage medium

Similar Documents

Publication Publication Date Title
CN110162611A (en) A kind of intelligent customer service answer method and system
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN111062220B (en) End-to-end intention recognition system and method based on memory forgetting device
CN112035640A (en) Refined question-answering method based on intelligent question-answering robot, storage medium and intelligent equipment
CN111858854B (en) Question-answer matching method and relevant device based on historical dialogue information
CN110321564B (en) Multi-round dialogue intention recognition method
CN111611382A (en) Dialect model training method, dialog information generation method, device and system
CN106844344B (en) Contribution calculation method for conversation and theme extraction method and system
CN110597966A (en) Automatic question answering method and device
CN111309891B (en) System for reading robot to automatically ask and answer questions and application method thereof
CN114757176A (en) Method for obtaining target intention recognition model and intention recognition method
CN114970560A (en) Dialog intention recognition method and device, storage medium and intelligent device
CN113590778A (en) Intelligent customer service intention understanding method, device, equipment and storage medium
CN114678014A (en) Intention recognition method, device, computer equipment and computer readable storage medium
CN111738018A (en) Intention understanding method, device, equipment and storage medium
CN113342958A (en) Question-answer matching method, text matching model training method and related equipment
CN113704444A (en) Question-answering method, system, equipment and storage medium based on natural language processing
CN114238373A (en) Method and device for converting natural language question into structured query statement
CN113297365B (en) User intention judging method, device, equipment and storage medium
CN115345177A (en) Intention recognition model training method and dialogue method and device
CN113449074A (en) Sentence vector similarity matching optimization method and device containing proper nouns and storage medium
CN114510561A (en) Answer selection method, device, equipment and storage medium
CN114186048A (en) Question-answer replying method and device based on artificial intelligence, computer equipment and medium
CN114116975A (en) Multi-intention identification method and system
CN114722830A (en) Intelligent customer service semantic recognition general model construction method and question-answering robot

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210928