CN112182145A - Text similarity determination method, device, equipment and storage medium - Google Patents

Text similarity determination method, device, equipment and storage medium Download PDF

Info

Publication number
CN112182145A
CN112182145A CN201910600981.6A CN201910600981A CN112182145A CN 112182145 A CN112182145 A CN 112182145A CN 201910600981 A CN201910600981 A CN 201910600981A CN 112182145 A CN112182145 A CN 112182145A
Authority
CN
China
Prior art keywords
word
text
similarity
word frequency
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910600981.6A
Other languages
Chinese (zh)
Inventor
王艳花
邱龙泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910600981.6A priority Critical patent/CN112182145A/en
Publication of CN112182145A publication Critical patent/CN112182145A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a text similarity determination method, a text similarity determination device, text similarity determination equipment and a storage medium. The method comprises the following steps: acquiring a target text and an alternative text with similarity to be determined; determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property; determining the text similarity between the target text and the alternative text according to a preset word semantic weight, a preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property. Through the technical scheme, the similarity of the texts can be more accurately determined.

Description

Text similarity determination method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to a big data mining technology, in particular to a text similarity determining method, a text similarity determining device, text similarity determining equipment and a storage medium.
Background
With the development of the internet, more and more data appear on the internet in the form of texts, such as microblog messages, news headlines, post words, commodity comments in an e-commerce platform, commodity questions, and answers of buyers to the commodity questions. The internet text data application machine learning technology is aimed at, valuable information is mined from the internet text data application machine learning technology to provide useful convenience for life of people, and the problem that the demand of different aspects is met becomes a very hot topic in the current big data application technology is solved.
Take the question and answer for the goods in the e-commerce platform as an example. When making a purchase decision, a shopper will typically ask a question of the person who purchased the item, or browse the question and answer data under the item, or ask a customer service to fully understand the actual information of the item. From the perspective of a user, the user needs to browse historical question and answer data based on questions that the user wants to ask, check whether people ask similar questions, and possibly find out satisfactory answers with difficulty if the number of questions is large; from the customer service perspective, similarity calculation needs to be performed on the questions posed by the user and the existing questions in the question bank, and several most similar questions are found out, so as to answer the questions asked by the user with the help of the answers of the similar questions. Thus, it is necessary to calculate the similarity between the user's question and an existing question in the question bank.
At present, a text similarity determination scheme aiming at the problems mainly adopts a vector space model, namely, each word in a text is mapped into a vector space, the cosine distance between vectors is calculated, and the smaller the distance is, the greater the similarity of the words is. There are two main similarity determination schemes based on the vector space model, one is a similarity determination scheme based on word importance, such as a Term Frequency-Inverse Document Frequency (TF-IDF) model. TF-IDF is used to evaluate the importance of a word to a document in a corpus of text, the importance of the word increasing in direct proportion to the number of occurrences in the document, but decreasing in inverse proportion to the frequency of occurrences in the corpus. The flow of the TF-IDF based similarity determination (i.e., problem matching) scheme is shown in fig. 1. The typical application of the scheme is that in a search engine, a search problem is input from a user, the search problem is segmented, a mapping relation between the segmentation and a document is established, after all documents are found out, the similarity score and the ranking of the search problem of the user and the existing documents in a knowledge base are calculated according to a similarity algorithm (namely TF/IDF algorithm), and the result is returned according to the ranking of the documents. The other is a similarity determination scheme based on Natural Language Processing (NLP), and the common model is a Word2vec model. The Word2vec model is trained based on Harris' distribution hypothesis (i.e. words with similar contexts are similar in semantics), the neural network model obtained by training can map each Word into a K-dimensional space to represent the Word by a vector, and then the semantic similarity of the Word is measured by the similarity between vectors in the vector space. The scheme flow of the NLP-based semantic similarity model is shown in fig. 2.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: 1) the similarity determination scheme based on the TF-IDF only uses the word frequency inverse word frequency to measure the importance of a word to a document, does not consider the position information of the word and does not consider the semantic information of the word, and the similarity accuracy between determined texts is low, so that the practical application requirement of finding out the most similar text in a question-answering system cannot be met. 2) The similarity determination scheme based on NLP has good effect only on similarity calculation at a word level, and when the similarity calculation is extended to sentence level, due to the complexity of a syntax structure, ideal effect cannot be achieved in actual application when sentence vectors are expressed simply by accumulating or splicing word vectors of all words in a sentence.
Disclosure of Invention
The embodiment of the invention provides a text similarity determination method, a text similarity determination device, text similarity determination equipment and a storage medium, so that the text similarity can be determined more accurately.
In a first aspect, an embodiment of the present invention provides a method for determining text similarity, including:
acquiring a target text and an alternative text with similarity to be determined;
determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
determining the text similarity between the target text and the alternative text according to a preset word semantic weight, a preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
In a second aspect, an embodiment of the present invention further provides a text similarity determining apparatus, where the apparatus includes:
the target text acquisition module is used for acquiring a target text and an alternative text with similarity to be determined;
the first similarity determining module is used for determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
and the second similarity determining module is used for determining the text similarity between the target text and the alternative text according to a preset word sense weight, a preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the text similarity determination method provided in any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the text similarity determining method provided in any embodiment of the present invention.
The word sense similarity between the target text and the alternative text and the word frequency inverse word frequency similarity containing the part-of-speech word frequency inverse word frequency similarity and/or the word frequency inverse word frequency similarity containing the part-of-speech word frequency inverse word frequency similarity are generated; and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight containing the preset non-word frequency inverse word frequency weight and/or the preset word frequency inverse word frequency weight containing the word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity. The method and the device have the advantages that the whole sentence semantics of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented from different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.
Drawings
FIG. 1 is a flow chart of a similarity determination method based on a word-frequency inverse word-frequency model in the prior art;
FIG. 2 is a flow chart of a similarity determination method based on a semantic similarity model of NLP in the prior art;
fig. 3a is a flowchart of a text similarity determining method according to a first embodiment of the present invention;
fig. 3b is a logic framework diagram of a text similarity determination method in the first embodiment of the present invention;
fig. 4a is a flowchart of a text similarity determining method in the second embodiment of the present invention;
FIG. 4b is a schematic diagram of the structure of the CBOW model according to the second embodiment of the present invention;
FIG. 4c is a schematic diagram of the Skip-Gram model structure in the second embodiment of the present invention;
fig. 5 is a flowchart of a text similarity determination method in the third embodiment of the present invention;
fig. 6a is a flowchart of a text similarity determining method in the fourth embodiment of the present invention;
fig. 6b is a logic framework diagram of a text similarity determination method in the fourth embodiment of the present invention;
fig. 7 is a schematic structural diagram of a text similarity determination apparatus according to a fifth embodiment of the present invention;
fig. 8 is a schematic structural diagram of an apparatus in the sixth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
The text similarity determination method provided by the embodiment can be suitable for topic, message and reply, consultation, suggestion and opinion feedback in an internet forum, and text similarity calculation based on network intelligent question answering, instant chat records and the like. The method may be performed by a text similarity determination apparatus, which may be implemented by software and/or hardware, and may be integrated in a device with a large-scale data operation function, such as a personal computer or a server. In the embodiment of the present invention, an intelligent question answering is taken as an example for explanation. Referring to fig. 3a, the method of the present embodiment specifically includes the following steps:
and S110, acquiring a target text and an alternative text with the similarity to be determined.
The target text is a text for which similarity needs to be calculated, and may be an existing text or a new text obtained from the outside. The alternative text is text for calculating the similarity with the target text. The text may be short text or long text. Short text refers to short length text such as a sentence or a few short sentences (small paragraphs).
In particular, the content input by the user can be received as the target text. Meanwhile, one or more alternative texts are obtained from the available texts which can be collected. It should be understood that the text database may be constructed in advance according to the existing text that can be collected, and then the alternative text may be obtained from the text database.
And S120, determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the alternative text.
Wherein the term semantic similarity is a similarity determined from the semantic dimension of the term. The word frequency inverse word frequency similarity is a similarity determined from the importance dimension of a word. The word frequency similarity is measured according to whether the two sentences have the same word and the frequencies are similar, and if so, the words are similar. The inverse word frequency measures the importance of each word to the sentence, and is mainly based on the fact that the more the word appears in the sentence, the less the word appears in other sentences, the more the word is important to the sentence. The word frequency and the inverse word frequency jointly represent the characteristics of each word in the sentence, so that the similarity of the sentences is calculated.
Illustratively, the word frequency inverse word frequency similarity includes an inflexibility word frequency inverse word frequency similarity and/or a word frequency inverse word frequency similarity. The word frequency-free inverse word frequency similarity refers to a word frequency-inverse word frequency similarity which does not distinguish the part of speech of each word in the target text, and the word frequency-containing inverse word frequency similarity refers to a word frequency-inverse word frequency similarity which distinguishes the part of speech of each word in the target text. Because the part of speech can reflect the semantics of the word to a certain extent, the part of speech-containing word frequency inverse word frequency similarity can be understood as the similarity determined from two dimensions of the semantics of the word and the importance of the word.
In the related technology, the similarity can be determined only from one dimension of the word, and when the similarity of texts with more words and more sentence semantics is calculated, the similarity determining method in the related technology cannot determine the similarity of the texts more accurately. Therefore, in the embodiment of the present invention, the similarity of at least two dimensions is used to determine the similarity of the text, specifically, the word meaning dimension and the word importance dimension are used, and the word importance dimension can be divided into the word importance dimension without word property and the word importance dimension with word property, so that although the text similarity measurement is still performed from the word granularity, the similarity of the text can be more completely represented by fusing the similarities of multiple dimensions.
In specific implementation, the target text and the alternative text can be used as input of a word meaning similarity model for word meaning similarity calculation, and the word meaning similarity between the target text and the alternative text can be obtained through the similarity calculation of the word meaning similarity model. Similarly, the target text and the alternative text can be used as input of a word frequency inverse word frequency similarity model for calculating the word frequency inverse word frequency similarity, and the word frequency inverse word frequency similarity between the target text and the alternative text can be obtained through the similarity calculation of the word frequency inverse word frequency similarity model. The Word sense similarity model may be a similarity model based on NLP, and may be a Word vector learning model such as a Word2vec model, a Glove model, or a Bert model. The word frequency inverse word frequency similarity model needs to calculate the word frequency inverse word frequency similarity without word property and the word frequency inverse word frequency similarity with word property, so the word frequency inverse word frequency similarity model can be divided into a word frequency inverse word frequency similarity model without word property and a word frequency inverse word frequency similarity model with word property, the two word frequency inverse word frequency similarity models can be the same model, such as a TF-IDF model, and only the input data is different, such as one input does not distinguish word groups with word property, and the other input distinguishes word groups with word property; or two different models can be adopted, and the model containing the word frequency and the inverse word frequency similarity can be required to perform specific processing aiming at the word property.
In the above process, the similarity calculation is performed on a single candidate text, and if the number of candidate texts is multiple, the similarity calculation between all candidate texts and the target text can be completed only by performing the loop operation of multiple above processes. In order to increase the calculation speed and thus increase the determination efficiency of candidate texts that are more similar to the target text in the multiple candidate texts, all the candidate texts may be calculated in parallel, for example, each candidate text is represented by a vector with different dimensions in advance, and then all the candidate texts are represented in different matrix forms. In specific implementation, the process can refer to fig. 3b, and all the candidate texts are characterized in advance as word sense feature matrices and word frequency inverse word frequency feature matrices (including word frequency inverse word frequency feature matrices without parts of speech and/or word frequency inverse word frequency feature matrices with parts of speech), which is called stock calculation. After the target text is determined, the target text is characterized into word sense characteristic vectors and word frequency inverse word frequency characteristic vectors (including word frequency inverse word frequency characteristic vectors without word parts and/or word frequency inverse word frequency characteristic vectors with word parts) based on the same model of stock calculation, and the process is called real-time incremental calculation. And finally, performing word sense similarity calculation by using the word sense characteristic vector and the word sense characteristic matrix, and performing word frequency inverse word frequency similarity calculation by using the word frequency inverse word frequency characteristic vector and the word frequency inverse word frequency characteristic matrix to determine the word sense similarity and the word frequency inverse word frequency similarity between the target text and each candidate text.
It should be noted that the part-of-speech-free word frequency inverse word frequency similarity and the part-of-speech-containing word frequency inverse word frequency similarity are not necessarily all calculated, and only one of them may be selected. That is, one or both of the part-of-speech-free word-frequency inverse word-frequency similarity model and the part-of-speech-frequency inverse word-frequency similarity model may be used in the present operation. At this time, the similarity determined in the operation may include word sense similarity, word frequency-free inverse word frequency similarity, word sense similarity, word frequency-containing inverse word frequency similarity, word sense similarity, word frequency-free inverse word frequency similarity, and word frequency-containing inverse word frequency similarity.
S130, determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity.
The preset word semantic weight and the preset word frequency inverse word frequency weight are preset weight values and are respectively used for determining the proportion of the word semantic similarity and the word frequency inverse word frequency similarity in the similarity fusion process, and the preset word semantic weight and the preset word frequency inverse word frequency weight can be set manually in advance. The text similarity refers to the similarity obtained after the similarities of different dimensions are fused, and can represent the comprehensive similarity of the target text and the alternative text.
Illustratively, the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without a part of speech and/or a preset word frequency inverse word frequency weight with a part of speech. It should be noted that the preset word frequency inverse word frequency weight is correspondingly consistent with the word frequency inverse word frequency similarity. If the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity without the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight without the part of speech; if the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity containing the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight containing the part of speech; if the word frequency inverse word frequency similarity is the word frequency inverse word frequency similarity without the part of speech and the word frequency inverse word frequency similarity containing the part of speech, the preset word frequency inverse word frequency weight is the preset word frequency inverse word frequency weight without the part of speech and the preset word frequency inverse word frequency weight containing the part of speech.
After obtaining the similarity between the target text and the candidate text in different dimensions, the similarity of each dimension needs to be fused so as to obtain the multi-dimensional text similarity. In specific implementation, according to the weight of the similarity under each dimension, the obtained similarities of different dimensions are subjected to weighted summation to determine the text similarity, and a weighted summation formula for determining the text similarity is as shown in formula (1):
score=w1·score1+w2·score2 (1)
wherein score represents the text similarity between the target text and the alternative text, w1And w2Respectively representing a preset word semantic weight and a preset word frequency inverse word frequency weight, score1And score2Respectively representing the word sense similarity and the word frequency inverse word frequency similarity.
According to the technical scheme of the embodiment, word sense similarity between a target text and a candidate text and word frequency inverse word frequency similarity containing word frequency without word and/or word frequency inverse word frequency are generated; and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight containing the preset non-word frequency inverse word frequency weight and/or the preset word frequency inverse word frequency weight containing the word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity. The method and the device have the advantages that the whole sentence semantics of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented from different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.
Example two
In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 4a, the text similarity determining method provided in this embodiment includes:
and S210, acquiring a target text and an alternative text with similarity to be determined.
S220, performing word segmentation on the target text to obtain each target non-part-of-speech word corresponding to the target text.
Since the target text contains at least one sentence and the similarity is determined on the granularity of words, the word segmentation processing needs to be performed on the target text first so as to split the target text into a plurality of words. Because each word after word segmentation has no part of speech, each word obtained by word segmentation is each target part of speech-free word corresponding to the target text.
And S230, determining word sense similarity of the target text and the alternative text based on a pre-trained word sense similarity model according to each target non-part-of-speech word and the alternative text.
According to the description of the above embodiment, the Word sense similarity model may be a Word2vec model, and the model uses a shallow two-layer neural network to predict a target Word by inputting a context of the target Word (which may correspond to a CBOW model structure), or predict a context of the Word by inputting the target Word (which may correspond to a Skip-Gram model structure), so as to train a text, obtain a parameter of a network hidden layer, and obtain the trained Word2vec model. The trained Word2vec model can map each Word to a vector space, and characterize the words as corresponding feature vectors.
Referring to FIG. 4b, the CBOW model has three layers, an input layer, a hidden layer and an output layer, which are prediction P (w)t|wt-k,wt-(k-1),L,wt-1,wt+1,wt+2,L,wt+k) Wherein w istFor the target word to be predicted, wt-k,wt-(k-1),L,wt-1,wt+1,wt+2,L,wt+kAnd (k is 2) is the context of the target word, namely, the first two words and the last two words of the target word are selected as the contexts, the operation from the input layer to the hidden layer is the addition of context vectors, and the hidden layer to the output layer adopts the measuring layer Softmax or Negative sampling (Negative sampling). Referring to FIG. 4c, the Skip-Gram model also has three layers, an input layer, a hidden layer and an output layer, but in contrast to the CBOW model, the Skip-Gram model is a predictive P (w)i|wt) Where t-c ≦ i ≦ t + c and i ≠ t, c is the window size (constant representing the context size), assuming there is one w1,w2,w3,…,wTThe target of Skip-gram is to maximize:
Figure BDA0002119275270000101
Figure BDA0002119275270000102
after the Word2vec model structure is determined, model training is required, and the model training process can be referred to as the inventory calculation flow of fig. 3 b.
Illustratively, the Word2vec model is trained in advance based on a short text database and a long text database, wherein the long text database is constructed in advance based on service data corresponding to a service scene, and the short text database is constructed in advance based on service data corresponding to a service requirement under the service scene.
The short text database is a database composed of a large number of short texts, and the collection source of the short text data in the short text database is related to specific service requirements. For example, the service requirement is only to calculate the similarity, the data source of the short text database can be the short text in any network platform, and the more the sources, the better; if the service requirement is that the reference answer is determined according to the similar questions, the data sources of the short text database can only be the question questions in the intelligent question-answering system corresponding to the target text because the emphasis points of the answers in different intelligent question-answering systems are different. The long text database refers to texts with more sentences, such as articles or product specifications, and the collection source of the data is related to specific business requirements. For example, if the service scenario is intelligent question answering, a long text database can be constructed by collecting long texts with stronger sentence logicality, such as a daemon recommendation article, a product introduction or a product specification, from any intelligent question answering system.
Since the target text may be a short text or a long text, in order to enhance text compatibility of the Word2vec model, in this embodiment, the short text database and the long text database are used simultaneously for model training, and in order to improve the semantic expression degree of the Word2vec model, the more complete the Word amount in the long text database, the better the training effect of the Word2vec model. In specific implementation, firstly, training data needs to be acquired, namely, a long text database and a short text database are acquired according to a service scene and service requirements. Then, it is preprocessed, such as word segmentation and data cleansing, which removes some stop words and punctuation marks and only retains valid data. And then, inputting the preprocessed long text database and short text database into a Word2vec model for model training to obtain a trained Word2vec model.
And in the increment calculation part, inputting each target non-part-of-speech Word into a Word sense similarity model Word2vec model obtained by training, and obtaining the row vector representation of each target non-part-of-speech Word. And then, performing mean calculation on the corresponding columns of the line vectors to obtain a line vector of a column mean value as a word sense feature vector of the target text. Likewise, word sense feature vectors corresponding to the alternative texts may be obtained. And finally, calculating the vector cosine of the word sense characteristic vector of the target text and the word sense characteristic vector corresponding to the alternative text, so as to obtain the word sense similarity of the target text and the alternative text.
Exemplarily, S230 includes: inputting each target non-part-of-speech word into a word sense similarity model to generate a word sense characteristic vector corresponding to the target text; determining word meaning similarity of the target text and the alternative text corresponding to the line vector according to the word meaning feature vector and the line vector in the word meaning feature matrix; the word meaning characteristic matrix is generated according to the word segmentation result without the part of speech of the text database and the word meaning similarity model.
When each text in the text database is an alternative text, in order to improve the operation efficiency, each alternative text in the text database may be preprocessed by word segmentation and data cleaning in advance to obtain a corresponding word segmentation result without part of speech. And then inputting the Word sense feature matrix into a Word2vec model to obtain a Word sense feature matrix of the text database, wherein each row vector in the Word sense feature matrix represents a feature vector of an alternative text. And finally, respectively calculating vector cosines between the word sense characteristic vector and each row vector in the word sense characteristic matrix, and obtaining the word sense similarity between the target text and each alternative text in the text database.
S240, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target word without word frequency, the target text and the alternative text.
The word frequency inverse word frequency similarity model in the embodiment adopts a word frequency inverse word frequency model TF-IDF model, which is constructed based on the word frequency TF value and the inverse word frequency IDF value of a word. The Term Frequency (TF) is the frequency of occurrence of a given word in the text, and the TF value is calculated by normalizing the term frequency to prevent longer text. Inverse Document Frequency (IDF) is a measure of the general importance of a word. The specific calculation formula is as follows:
Figure BDA0002119275270000121
Figure BDA0002119275270000122
Figure BDA0002119275270000123
wherein n isi,jMeaning word tiIn the text djNumber of occurrences, Σknk,jRepresenting text djThe sum of the number of all words in the text database, | D | is the total number of texts in the text database, | { j: t |, wherei∈djDenotes the inclusion of a word t in the text databaseiThe amount of text of (c). To ensure that the denominator is not 0, | { j: t ] is typically usedi∈dj}|+1。
According to the calculation formula of the TF-IDF value, the word frequency inverse word frequency model needs to be trained in advance according to the text database to determine | D |. Meanwhile, the vector dimension of the vector representation of one word needs to be determined according to the number of non-repeated words in the text database. And after the trained TF-IDF model is obtained, inputting each target non-part-of-speech word, the target text and the text database into the TF-IDF model, obtaining a TF-IDF value of each target non-part-of-speech word to form a non-part-of-speech word frequency inverse word frequency characteristic vector of the target text, wherein the column number of the vector is consistent with the dimension number of the determined vector, the positions of elements corresponding to the target non-part-of-speech words are filled with the TF-IDF value, and the positions of the other elements are filled with 0. Similarly, a part-of-speech word frequency inverse word frequency feature vector corresponding to the candidate text can be obtained. And finally, calculating the vector cosine of the part-of-speech-free word frequency inverse word frequency characteristic vector of the target text and the part-of-speech-free word frequency inverse word frequency characteristic vector corresponding to the alternative text, so as to obtain the part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text.
Exemplarily, S240 includes: inputting each target word without word property and target text into a word frequency inverse word frequency similarity model, and generating a word frequency inverse word frequency eigenvector without word property corresponding to the target text; determining the similarity of the word frequency inverse word frequency without the part of speech of the target text and the alternative text corresponding to the row vector according to the word frequency inverse word frequency without the part of speech eigenvector and the row vector in the word frequency inverse word frequency without the part of speech eigenvector characteristic matrix; and generating the characteristic matrix of the word frequency inverse word frequency without the part of speech according to the word segmentation result without the part of speech of the text database and the similarity model of the word frequency inverse word frequency.
When each text in the text database is an alternative text, in order to improve the operation efficiency, each alternative text in the text database may be preprocessed by word segmentation and data cleaning in advance to obtain a corresponding word segmentation result without part of speech. And then inputting the characteristic matrix into a TF-IDF model to obtain a part-of-speech word frequency-free inverse word frequency characteristic matrix of the text database, wherein the column number of the part-of-speech word frequency-free inverse word frequency characteristic matrix is consistent with the dimension of the vector, the element position of a corresponding part-of-speech word is not supplemented with 0, and each row vector in the matrix represents a characteristic vector of a candidate text. And finally, respectively calculating vector cosines between the word frequency inverse word frequency eigenvector without the part of speech and each row vector in the word frequency inverse word frequency eigenvector without the part of speech, so as to obtain the word frequency inverse word frequency similarity between the target text and each alternative text.
And S250, determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property.
Respectively determining the preset word frequency inverse word frequency weight and the word frequency inverse word frequency similarity in the formula (1) as a preset word frequency inverse word frequency weight w without word property21And word frequency inverse word frequency similarity score without part of speech21The text similarity between the target text and the alternative text can be obtained according to the following formula:
score=w1·score1+w21·score21 (7)
according to the technical scheme of the embodiment, the text similarity between the target text and the alternative text is determined by determining the word sense similarity and the word frequency inverse similarity without word property of the target text and the alternative text, and according to the preset word semantic weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property, the text similarity between the target text and the alternative text is determined, the text similarity of the target text is determined from two dimensions of the word sense and the word frequency inverse similarity without word property, and the determination accuracy of the text similarity is improved to a certain extent.
EXAMPLE III
In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 5, the text similarity determining method provided in this embodiment includes:
s310, obtaining a target text and an alternative text with similarity to be determined.
And S320, performing word segmentation and part-of-speech tagging on the target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text.
In this embodiment, the part of speech of each word needs to be distinguished, so after the target text is segmented, part of speech tagging needs to be performed on each obtained target non-part of speech word to obtain a target part of speech tagged word. If a target part-of-speech word has two or more parts-of-speech, the target part-of-speech word generates two or more target part-of-speech tagged words.
S330, determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text.
S340, determining word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part of speech tagging word, the target text and the alternative text.
In the embodiment, the model for determining the word frequency-containing inverse word frequency similarity also samples the word frequency-containing inverse word frequency model TF-IDF model, only the model parameter sigma in the TF-IDF modelknk,jAnd | { j: ti∈djWill increase due to the introduction of parts of speech. Then, the process of determining the word frequency-containing inverse word frequency similarity is as follows: marking each target part of speech with words, target texts and textsAnd inputting a TF-IDF model into the database, obtaining a TF-IDF value of each target part-of-speech tagging word to form a part-of-speech-frequency-containing inverse word-frequency feature vector of the target text, wherein the column number of the vector is consistent with the determined vector dimension, the TF-IDF value is filled in the position of the corresponding element of the target part-of-speech-containing word, and the positions of the other elements are supplemented with 0. Similarly, word segmentation, part-of-speech tagging and part-of-speech tagging can be performed on the alternative text, so that a part-of-speech-frequency-containing inverse word-frequency feature vector corresponding to the alternative text can be obtained. And finally, calculating the vector cosine of the part-of-speech-frequency-containing inverse word-frequency characteristic vector of the target text and the part-of-speech-frequency-containing inverse word-frequency characteristic vector corresponding to the alternative text, so as to obtain the part-of-speech-frequency-containing inverse word-frequency similarity of the target text and the alternative text.
Exemplarily, S340 includes: inputting each target part-of-speech tagging word and the target text into a word frequency inverse word frequency similarity model, and generating a word frequency inverse word frequency characteristic vector containing the part-of-speech corresponding to the target text; determining the part-of-speech-containing word frequency inverse word frequency similarity of the target text and the alternative text corresponding to the row vector according to the part-of-speech-containing word frequency inverse word frequency feature vector and the row vector in the part-of-speech-containing word frequency inverse word frequency feature matrix; and generating a word frequency-inverse word frequency feature matrix containing parts of speech according to the part of speech tagging word segmentation result of the text database and the word frequency-inverse word frequency similarity model.
When each text in the text database is an alternative text, in order to improve the operation efficiency, preprocessing of word segmentation, part-of-speech tagging and data cleaning can be performed on each alternative text in the text database in advance to obtain a corresponding word segmentation result containing part-of-speech. And then inputting the characteristic matrix into a TF-IDF model to obtain a part-of-speech-frequency-containing inverse word-frequency characteristic matrix of the text database, wherein the column number of the part-of-speech-frequency-containing inverse word-frequency characteristic matrix is consistent with the number of non-repeated words without repeated parts of speech in the text database, the element positions without corresponding part-of-speech-containing words are supplemented with 0, and each row vector in the matrix represents a characteristic vector of an alternative text. And finally, respectively calculating vector cosines between the word frequency inverse word frequency characteristic vector containing the part of speech and each row vector in the word frequency inverse word frequency characteristic matrix containing the part of speech, so as to obtain the word frequency inverse word frequency similarity between the target text and each alternative text.
S350, determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse weight containing the part of speech, the word semantic similarity and the word frequency inverse similarity containing the part of speech.
Respectively determining the preset word frequency inverse word frequency weight and the word frequency inverse word frequency similarity in the formula (1) as a preset word frequency inverse word frequency weight w containing word frequency22And a part-of-speech-frequency-containing inverse word-frequency similarity score22The text similarity between the target text and the alternative text can be obtained according to the following formula:
score=w1·score1+w22·score22 (8)
according to the technical scheme of the embodiment, the similarity of the target text and the candidate text is determined by determining the similarity of word senses and the inverse word frequency similarity containing part-of-speech of the target text and the candidate text, and determining the similarity of the target text and the candidate text according to the preset word semantic weight, the preset inverse word frequency weight containing part-of-speech, the similarity of word senses and the inverse word frequency similarity containing part-of-speech, so that the similarity of the target text is determined from two dimensions of the word senses and the importance of the word containing part-of-speech, and the determination accuracy of the text similarity is improved to a certain extent.
Example four
In this embodiment, based on the first embodiment, further optimization is performed on "determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the candidate text", and optimization is performed on "determining the text similarity of the target text and the candidate text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 6a, the text similarity determining method provided in this embodiment includes:
and S410, acquiring a target text and an alternative text with similarity to be determined.
And S420, performing word segmentation and part-of-speech tagging on the target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text.
Referring to fig. 6b, the target text is subjected to word segmentation processing to obtain each target non-part-of-speech word, and is subjected to word segmentation and part-of-speech tagging processing to obtain each target part-of-speech tagged word.
S430, determining word sense similarity of the target text and the alternative text based on a pre-trained word sense similarity model according to each target non-part-of-speech word and the alternative text.
The Word sense characteristic vector of the target text can be obtained by each target non-part-of-speech Word and the trained Word2vec model, and the Word sense similarity between the Word sense characteristic vector and the line vector in the Word sense characteristic vector corresponding to the text database is calculated to obtain the Word sense similarity score between the target text and the alternative text in the text database1. There are n alternative texts in the text database, so that n score can be obtained1
S440, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target word without word frequency, the target text and the alternative text.
The method comprises the steps that each target non-part-of-speech word and a trained TF-IDF model can obtain a non-part-of-speech word frequency inverse word frequency characteristic vector of a target text, the non-part-of-speech word frequency inverse word frequency characteristic vector is subjected to non-part-of-speech word frequency inverse word frequency similarity calculation with a row vector in the non-part-of-speech word frequency inverse word frequency characteristic vector corresponding to a text database, and then the non-part-of-speech word frequency inverse word frequency similarity score of the target text and an alternative text in the text database can be obtained21. There are n alternative texts in the text database, so that n score can be obtained21
S450, determining the word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part of speech tagging word, the target text and the alternative text.
The word frequency and inverse word frequency characteristic vector containing the part of speech of the target text can be obtained by each target part of speech tagging word and the trained TF-IDF model, and the word frequency and inverse word frequency similarity containing the part of speech of the target text is calculated with each row vector in the word frequency and inverse word frequency characteristic vector containing the part of speech corresponding to the text database, so that the word frequency and inverse word frequency similarity containing the part of speech of the target text and the alternative text in the text database can be obtained as score22. There are n alternative texts in the text database, so that n score can be obtained22
And S460, determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the preset word frequency inverse weight with word property, the word sense similarity, the word frequency inverse similarity without word property and the word frequency inverse similarity with word property.
Determining the preset word frequency inverse word frequency weight in the formula (1) as a preset word frequency inverse word frequency weight w without word property21And presetting a word frequency inverse word frequency weight w containing part of speech22Determining the word frequency inverse word frequency similarity as the non-part-of-speech word frequency inverse word frequency similarity score21And a part-of-speech-frequency-containing inverse word-frequency similarity score22The text similarity between the target text and the alternative text can be obtained according to the following formula:
score=w1·score1+w21·score21+w22·score22 (9)
there are n candidate texts in the text database, so that n scores can be obtained through formula (9), and each score represents the text similarity between the target text and the corresponding candidate text.
And S470, performing descending order arrangement on the alternative texts according to the similarity of the texts, and generating an ordering result.
After determining the similarity of each text, a plurality of candidate texts with the similarity satisfying the service requirement can be determined from all the candidate texts. At this time, in order to further improve the service operation efficiency, all the candidate texts may be sorted in a descending order according to the text similarity, and a sorting result is generated.
And S480, extracting a preset number of alternative texts from the sequencing result to be used as similar texts of the target text.
And determining the number of the alternative texts to be selected, namely the preset number according to the service requirement. And then, extracting a preset number of alternative texts ranked at the top from the ranking result as similar texts which are similar to the target text.
And S490, when the service scene is an intelligent question-answering scene and the service requirement is to determine an alternative answer of the target text, extracting the answer corresponding to each similar text from the short text database to serve as the alternative answer of the target text.
In the service scenario of intelligent question and answer, the question and answer are usually short texts, so the target text is the target short text. If the service requirement is simply to determine similar short text, S480 may end. However, if the service requirement is an alternative answer for determining the target text, the alternative text needs to be a short text in a short text database. At this time, the short answer text for each similar text is extracted from the short text database to serve as an alternative answer of the target text, so that a more accurate answer is provided for a user in an intelligent customer service in a faster and more convenient manner, or the short answer is provided for a human customer service in assisting the human customer service to serve the answer of a similar question.
According to the technical scheme of the embodiment, the text similarity of the target text is determined from three dimensions of word meaning, word frequency similarity and word frequency-containing word importance by determining the word meaning similarity, word frequency-free inverse word frequency similarity and word frequency-containing word frequency inverse word frequency similarity of the target text and the alternative text, and determining the text similarity of the target text and the alternative text according to the preset word semantic weight, the preset word frequency-free inverse word frequency weight, the preset word frequency-containing word frequency inverse word frequency weight, the word meaning similarity, the word frequency-free inverse word frequency similarity and the word frequency-containing word frequency similarity, so that the text similarity of the target text is determined from the three dimensions of word meaning, word frequency-free word importance and word frequency-containing word importance, and the determination precision of the text similarity is improved to a greater extent. The candidate texts are sorted according to the text similarity and the answer of the similar text which is sorted in the front is determined to be used as the candidate answer of the target text, so that the determination accuracy and efficiency of the candidate answer in the intelligent question-answering system can be improved.
EXAMPLE five
The present embodiment provides a text similarity determining apparatus, and referring to fig. 7, the apparatus specifically includes:
the target text determination module 710 is configured to obtain a target text and an alternative text with similarity to be determined;
a first similarity determining module 720, configured to determine word sense similarity and word frequency inverse word frequency similarity between the target text and the candidate text, where the word frequency inverse word frequency similarity includes word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
the second similarity determining module 730 is configured to determine the text similarity between the target text and the candidate text according to the preset word sense weight, the preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity, where the preset word frequency inverse word frequency weight includes a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
Optionally, the first similarity determining module 720 is specifically configured to:
performing word segmentation on the target text to obtain target non-part-of-speech words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining the part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech-free word, the target text and the alternative text;
accordingly, the second similarity determination module 730 is specifically configured to:
and determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the word sense similarity and the word frequency inverse similarity without word property.
Optionally, the first similarity determining module 720 is specifically configured to:
performing word segmentation and part-of-speech tagging on a target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining part-of-speech-containing word frequency inverse word frequency similarity between the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech tagging word, the target text and the alternative text;
accordingly, the second similarity determination module 730 is specifically configured to:
and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse weight containing the part of speech, the word semantic similarity and the word frequency inverse similarity containing the part of speech.
Optionally, the first similarity determining module 720 is specifically configured to:
performing word segmentation and part-of-speech tagging on a target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining the part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech-free word, the target text and the alternative text;
determining part-of-speech-containing word frequency inverse word frequency similarity between the target text and the alternative text based on the word frequency inverse word frequency similarity model according to each target part-of-speech tagging word, the target text and the alternative text;
accordingly, the second similarity determination module 730 is specifically configured to:
and determining the text similarity between the target text and the alternative text according to the preset word sense weight, the preset word frequency inverse weight without word property, the preset word frequency inverse weight with word property, the word sense similarity, the word frequency inverse similarity without word property and the word frequency inverse similarity with word property.
Optionally, the Word frequency inverse Word frequency similarity model is a Word frequency inverse Word frequency model, and the Word sense similarity model is a Word2vec model;
the Word2vec model is pre-trained on the basis of a short text database and a long text database, wherein the long text database is pre-constructed on the basis of service data corresponding to a service scene, and the short text database is pre-constructed on the basis of service data corresponding to service requirements in the service scene.
Optionally, on the basis of the foregoing apparatus, the apparatus further includes a similar text determination module, configured to:
when a plurality of alternative texts are available, after the text similarity between the target text and the alternative texts is determined, performing descending order arrangement on the alternative texts according to the text similarity to generate an ordering result;
and extracting a preset number of alternative texts from the sequencing result to be used as similar texts of the target text.
Further, on the basis of the above device, the device further includes an alternative answer determining module, configured to:
when the service scene is an intelligent question-answering scene and the service requirement is to determine the alternative answers of the target text, the target text is the target short text, the alternative texts are short texts in a short text database, a preset number of alternative texts are extracted from the sequencing result and serve as the similar texts of the target text, and then the answer corresponding to each similar text is extracted from the short text database and serves as the alternative answer of the target text.
Through the text similarity determining device in the fifth embodiment of the invention, the overall semantics of the sentences of the text can be grasped by utilizing the word senses and the multiple dimensional characteristics of the word frequency and/or the word property, so that the text similarity is represented by different dimensions, the comprehensive similarity of the text is obtained, and the accuracy of the text similarity is improved to a great extent.
The text similarity determining device provided by the embodiment of the invention can execute the text similarity determining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.
It should be noted that, in the embodiment of the text similarity determining apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
EXAMPLE six
Referring to fig. 8, the present embodiment provides an apparatus, which includes: one or more processors 820; the storage 810 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 820, the one or more processors 820 implement the text similarity determination method provided in the embodiment of the present invention, including:
acquiring a target text and an alternative text with similarity to be determined;
determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
Of course, those skilled in the art will understand that the processor 820 may also implement the technical solution of the text similarity determination method provided in any embodiment of the present invention.
The device shown in fig. 8 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention. As shown in fig. 8, the apparatus includes a processor 820, a storage device 810, an input device 830, and an output device 840; the number of the processors 820 in the device may be one or more, and one processor 820 is taken as an example in fig. 8; the processor 820, storage 810, input 830, and output 840 of the apparatus may be connected by a bus or other means, such as by bus 850 in fig. 8.
The storage device 810, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text similarity determination method in the embodiment of the present invention (for example, a target text acquisition module, a first similarity determination module, and a second similarity determination module in the text similarity determination device).
The storage device 810 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 810 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 810 may further include memory located remotely from processor 820, which may be connected to devices over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 830 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus. The output device 840 may include a display device such as a display screen.
EXAMPLE seven
The present embodiments provide a storage medium containing computer-executable instructions which, when executed by a computer processor, are operable to perform a method of text similarity determination, the method comprising:
acquiring a target text and an alternative text with similarity to be determined;
determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the above method operations, and may also perform related operations in the text similarity determination method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions to enable a device (which may be a personal computer, a server, or a network device) to execute the text similarity determining method provided in the embodiments of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A text similarity determination method is characterized by comprising the following steps:
acquiring a target text and an alternative text with similarity to be determined;
determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
determining the text similarity between the target text and the alternative text according to a preset word semantic weight, a preset word frequency inverse word frequency weight, the word semantic similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
2. The method of claim 1, wherein determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the alternative text comprises:
performing word segmentation on the target text to obtain target non-part-of-speech words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text based on a word frequency inverse word frequency similarity model according to each target part-of-speech-free word, the target text and the alternative text;
correspondingly, determining the text similarity between the target text and the alternative text according to a preset word sense weight, a preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity comprises:
and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight without word property, the word semantic similarity and the word frequency inverse word frequency similarity without word property.
3. The method of claim 1, wherein determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the alternative text comprises:
performing word segmentation and part-of-speech tagging on the target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining part-of-speech-frequency-containing inverse word frequency similarity between the target text and the alternative text based on a word frequency inverse word frequency similarity model according to each target part-of-speech tagging word, the target text and the alternative text;
correspondingly, determining the text similarity between the target text and the alternative text according to a preset word sense weight, a preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity comprises:
and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse word frequency weight containing the part of speech, the word semantic similarity and the word frequency inverse word frequency similarity containing the part of speech.
4. The method of claim 1, wherein determining the word sense similarity and the word frequency inverse word frequency similarity of the target text and the alternative text comprises:
performing word segmentation and part-of-speech tagging on the target text to obtain target non-part-of-speech words and target part-of-speech tagged words corresponding to the target text;
determining word meaning similarity of the target text and the alternative text based on a pre-trained word meaning similarity model according to each target non-part-of-speech word and the alternative text;
determining part-of-speech-free word frequency inverse word frequency similarity of the target text and the alternative text based on a word frequency inverse word frequency similarity model according to each target part-of-speech-free word, the target text and the alternative text;
determining part-of-speech-frequency-containing inverse word frequency similarity between the target text and the alternative text based on a word frequency inverse word frequency similarity model according to each target part-of-speech tagging word, the target text and the alternative text;
correspondingly, determining the text similarity between the target text and the alternative text according to a preset word sense weight, a preset word frequency inverse word frequency weight, and the word sense similarity and the word frequency inverse word frequency similarity comprises:
and determining the text similarity between the target text and the alternative text according to the preset word semantic weight, the preset word frequency inverse weight without word property, the preset word frequency inverse weight with word property, the word sense similarity, the word frequency inverse similarity without word property and the word frequency inverse similarity with word property.
5. The method according to any one of claims 2-4, wherein the Word frequency inverse Word frequency similarity model is a Word frequency inverse Word frequency model, and the Word sense similarity model is a Word2vec model;
the Word2vec model is pre-trained on the basis of a short text database and a long text database, wherein the long text database is pre-constructed on the basis of service data corresponding to a service scene, and the short text database is pre-constructed on the basis of service data corresponding to service requirements under the service scene.
6. The method according to claim 1, wherein when there are a plurality of candidate texts, after the determining the text similarity between the target text and the candidate text, further comprising:
performing descending order arrangement on each alternative text according to the text similarity to generate an ordering result;
and extracting a preset number of the alternative texts from the sequencing result to be used as similar texts of the target text.
7. The method according to claim 6, wherein when the service scenario is an intelligent question and answer scenario and the service requirement is to determine the candidate answer of the target text, the target text is a target short text, the candidate text is a short text in a short text database, and after the extracting a preset number of the candidate texts from the sorting result as similar texts of the target text, the method further comprises:
and extracting an answer corresponding to each similar text from the short text database to serve as a candidate answer of the target text.
8. A text similarity determination apparatus, comprising:
the target text acquisition module is used for acquiring a target text and an alternative text with similarity to be determined;
the first similarity determining module is used for determining word sense similarity and word frequency inverse word frequency similarity of the target text and the alternative text, wherein the word frequency inverse word frequency similarity comprises word frequency inverse word frequency similarity without word property and/or word frequency inverse word frequency similarity with word property;
and the second similarity determining module is used for determining the text similarity between the target text and the alternative text according to a preset word sense weight, a preset word frequency inverse word frequency weight, the word sense similarity and the word frequency inverse word frequency similarity, wherein the preset word frequency inverse word frequency weight comprises a preset word frequency inverse word frequency weight without word property and/or a preset word frequency inverse word frequency weight with word property.
9. An apparatus, characterized in that the apparatus comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the text similarity determination method of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a text similarity determination method according to any one of claims 1 to 7.
CN201910600981.6A 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium Pending CN112182145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910600981.6A CN112182145A (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910600981.6A CN112182145A (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112182145A true CN112182145A (en) 2021-01-05

Family

ID=73915404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910600981.6A Pending CN112182145A (en) 2019-07-04 2019-07-04 Text similarity determination method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112182145A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505196A (en) * 2021-06-30 2021-10-15 和美(深圳)信息技术股份有限公司 Part-of-speech-based text retrieval method and device, electronic equipment and storage medium
CN113837594A (en) * 2021-09-18 2021-12-24 深圳壹账通智能科技有限公司 Quality evaluation method, system, device and medium for customer service in multiple scenes
CN115329742A (en) * 2022-10-13 2022-11-11 深圳市大数据研究院 Scientific research project output evaluation acceptance method and system based on text analysis
CN116167352A (en) * 2023-04-03 2023-05-26 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN116228249A (en) * 2023-05-08 2023-06-06 陕西拓方信息技术有限公司 Customer service system based on information technology

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505196A (en) * 2021-06-30 2021-10-15 和美(深圳)信息技术股份有限公司 Part-of-speech-based text retrieval method and device, electronic equipment and storage medium
CN113505196B (en) * 2021-06-30 2024-01-30 和美(深圳)信息技术股份有限公司 Text retrieval method and device based on parts of speech, electronic equipment and storage medium
CN113837594A (en) * 2021-09-18 2021-12-24 深圳壹账通智能科技有限公司 Quality evaluation method, system, device and medium for customer service in multiple scenes
CN115329742A (en) * 2022-10-13 2022-11-11 深圳市大数据研究院 Scientific research project output evaluation acceptance method and system based on text analysis
CN116167352A (en) * 2023-04-03 2023-05-26 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN116228249A (en) * 2023-05-08 2023-06-06 陕西拓方信息技术有限公司 Customer service system based on information technology

Similar Documents

Publication Publication Date Title
CN109284357B (en) Man-machine conversation method, device, electronic equipment and computer readable medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
Mukhtar et al. Urdu sentiment analysis using supervised machine learning approach
CN110019732B (en) Intelligent question answering method and related device
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN113704451B (en) Power user appeal screening method and system, electronic device and storage medium
CN112182145A (en) Text similarity determination method, device, equipment and storage medium
CN109960756B (en) News event information induction method
CN109766431A (en) A kind of social networks short text recommended method based on meaning of a word topic model
CN106708929B (en) Video program searching method and device
CN108549723B (en) Text concept classification method and device and server
Yu et al. Answering opinion questions on products by exploiting hierarchical organization of consumer reviews
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
US20200192921A1 (en) Suggesting text in an electronic document
CN111080055A (en) Hotel scoring method, hotel recommendation method, electronic device and storage medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN112270188A (en) Questioning type analysis path recommendation method, system and storage medium
CN111353044A (en) Comment-based emotion analysis method and system
Mozafari et al. Emotion detection by using similarity techniques
CN114255096A (en) Data requirement matching method and device, electronic equipment and storage medium
CN113282754A (en) Public opinion detection method, device, equipment and storage medium for news events
CN106570196B (en) Video program searching method and device
CN115248839A (en) Knowledge system-based long text retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination