CN112069790A - Text similarity recognition method and device and electronic equipment - Google Patents

Text similarity recognition method and device and electronic equipment Download PDF

Info

Publication number
CN112069790A
CN112069790A CN201910425307.9A CN201910425307A CN112069790A CN 112069790 A CN112069790 A CN 112069790A CN 201910425307 A CN201910425307 A CN 201910425307A CN 112069790 A CN112069790 A CN 112069790A
Authority
CN
China
Prior art keywords
sentence
vector
sentences
similar
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910425307.9A
Other languages
Chinese (zh)
Inventor
陈克寒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910425307.9A priority Critical patent/CN112069790A/en
Publication of CN112069790A publication Critical patent/CN112069790A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The embodiment of the invention provides a text similarity recognition method, a text similarity recognition device and electronic equipment, wherein the text similarity recognition method comprises the following steps: obtaining sentence vectors corresponding to sentences in a given sentence set; coding each sentence vector by adopting a feature coder obtained by learning in advance to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar; generating semantic signatures of corresponding sentences based on positive and negative characteristics of all dimension values in the sentence characteristic vector, and constructing reverse indexes from the semantic signatures to the corresponding sentences; and determining similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures. The scheme of the embodiment of the invention can effectively improve the similar effect between the identified similar sentences.

Description

Text similarity recognition method and device and electronic equipment
Technical Field
The present application relates to the field of computers, and in particular, to a text similarity recognition method and apparatus, and an electronic device.
Background
A more effective machine learning algorithm relies mostly on a large amount of high quality labeled training data. Obtaining these high quality label data has always been a problem of high labor costs in the field of machine learning. For massive text data, if business personnel directly mark labels, the time and labor are extremely wasted.
In the prior art, similar text data are generally labeled in batches by analyzing the similarity between texts, so that the efficiency of obtaining label data is improved, and the following method is mainly adopted for analyzing the similarity of massive texts:
firstly, for a sentence vector of each sentence in a given sentence set, converting the sentence vector into a 0/1 digital semantic signature with fixed dimension according to the positive and negative values of the dimension; then, based on the index of semantic signatures, preliminarily determining that sentences indexed by the same semantic signature (or semantic signature fragment) form candidate similar sentence groups, and then respectively determining the similar sentences corresponding to each sentence in each candidate similar sentence group.
In the processing process, the semantic signature is used for decomposing the process of searching for the similar sentences from the sentence set into the process of searching for the similar sentences of each sentence in each candidate similar sentence group so as to reduce the complexity of similarity calculation. However, this method cannot ensure that semantic signatures obtained from actually similar sentences are also relatively similar when forming semantic signatures, thereby affecting the similarity effect of the finally determined similar sentences.
Disclosure of Invention
The invention provides a text similarity recognition method and device and electronic equipment, which can effectively improve the similarity effect between recognized similar sentences.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, a text similarity recognition method is provided, including:
obtaining sentence vectors corresponding to sentences in a given sentence set;
coding each sentence vector by adopting a feature coder obtained by learning in advance to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar;
generating semantic signatures of corresponding sentences based on positive and negative characteristics of all dimension values in the sentence characteristic vector, and constructing reverse indexes from the semantic signatures to the corresponding sentences;
and determining similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures.
In a second aspect, an apparatus for recognizing text similarity is provided, including:
a sentence vector acquisition module for acquiring sentence vectors corresponding to each sentence in a given sentence set;
the sentence vector coding module is used for coding each sentence vector by adopting a feature coder obtained by pre-learning to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar;
the signature index construction module is used for generating semantic signatures of corresponding sentences based on positive and negative features of all dimension values in the sentence feature vector and constructing reverse indexes from the semantic signatures to the corresponding sentences;
and the similar sentence determining module is used for determining the similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures.
In a third aspect, a text vectorization processing method is provided, including:
carrying out word vector conversion on a sentence to be processed to obtain a word vector matrix corresponding to the sentence;
processing the word vector matrix through a model algorithm in an attention model to obtain a sentence vector corresponding to the sentence;
wherein the attention model comprises a global context vector and a domain context vector of at least one domain, and the model algorithm is formed based on the global context vector or one of the domain context vectors.
In a fourth aspect, there is provided a text vectorization processing apparatus including:
the word vector acquisition module is used for carrying out word vector conversion on the sentence to be processed to obtain a word vector matrix corresponding to the sentence;
a sentence vector acquisition module, configured to process the word vector matrix through a model algorithm in an attention model to obtain a sentence vector corresponding to the sentence;
wherein the attention model comprises a global context vector and a domain context vector of at least one domain, and the model algorithm is formed based on the global context vector or one of the domain context vectors.
In a fifth aspect, an electronic device is provided, comprising:
a memory for storing a program;
and the processor is coupled to the memory and used for executing the program, and the program executes the text similarity identification method provided by the invention when running.
In a sixth aspect, an electronic device is provided, comprising:
a memory for storing a program;
and the processor is coupled to the memory and used for executing the program, and the program executes the text vectorization processing method provided by the invention when running.
The invention provides a text similarity recognition method, a text similarity recognition device and electronic equipment, wherein sentence vectors of sentences in a given sentence set are subjected to feature coding, so that sentence feature vectors generated by sentence vectors with similar semantics are correspondingly similar in each vector segment, otherwise, the sentence feature vectors are not similar; when the semantic signatures of sentences are generated based on the positive and negative characteristics of all dimension values in the sentence characteristic vectors, because the sentence characteristic vectors with similar semantics are similar in segments, the semantic signatures generated by the sentence characteristic vectors with similar semantics can also realize segment similarity as much as possible, thereby realizing overall similarity; when the semantic signatures are constructed to the reverse indexes of the corresponding sentences, because the semantic signatures corresponding to the sentences with similar semantics are also similar, the similarity between the determined similar sentences can be effectively ensured when the similar sentences corresponding to the sentences in the sentence set are determined according to the reverse indexes of the semantic signatures.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic diagram of a training model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a loss function model according to an embodiment of the present invention;
FIG. 3 is a flowchart of a text similarity recognition method according to an embodiment of the present invention;
FIG. 4 is a flow diagram of a method of generating a sentence vector;
FIG. 5 is a first block diagram of a text similarity recognition apparatus according to an embodiment of the present invention;
FIG. 6 is a second block diagram of a text similarity recognition apparatus according to an embodiment of the present invention;
FIG. 7 is a first schematic structural diagram of an electronic device according to an embodiment of the present invention;
FIG. 8 is a flowchart of a text vectorization processing method according to an embodiment of the present invention;
FIG. 9 is a first block diagram of a text vectorization processing apparatus according to an embodiment of the present invention;
FIG. 10 is a block diagram of a second exemplary embodiment of a text vectorization processing apparatus;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention improves the defect that in the prior art, when the similarity of sentences is determined through semantic tags, because the semantic tags generated by sentence vectors with similar semantics cannot be ensured to be similar through the existing hash function, the similarity effect of the similar sentences determined after reverse indexing is difficult to ensure through the semantic tags, the core idea is that in the process of forming the semantic tags based on the sentence vectors, the sentence vectors are secondarily encoded through a feature encoder to generate corresponding sentence feature vectors, and the feature encoder can be obtained through learning, so that the sentence feature vectors generated by the sentence vectors with similar semantics are correspondingly similar on each feature vector segment, otherwise, the sentence feature vectors are not similar; thus, when generating semantic signatures of corresponding sentences based on the positive and negative features of each dimension value in the sentence feature vectors, the semantic signatures corresponding to the sentence feature vectors with similar semantics are also similar on the corresponding vector segments; by constructing the semantic signatures to the inverted indexes of the corresponding sentences, the sentences indexed by the similar semantic signatures are also relatively similar, so that the similarity between the determined similar sentences can be better ensured according to the inverted indexes of the semantic signatures.
When the characteristic encoder is trained, the characteristic encoder (hereinafter denoted by fc) and the existing semantic signature algorithm can be fused into an integral hash function model for training. The sentence vectors are converted into sentence characteristic vectors through fc coding by training fc, and the sentence characteristic vectors are subjected to the existing hash function to obtain the following specific characteristics of the semantic signature: and the semantically similar sentence vectors are similar to the corresponding generated semantic signatures.
The hash function model structure is as follows:
hash (x) ═ sgn (fc (x)), where
Figure BDA0002067311610000051
Where fc is the full connectivity layer.
Loss function: constructing sentence triples (q1, q2, q3), wherein: q1 and q2 are similar semantically; q1 and q3 are not similar semantically. After sentence vectors are obtained from the sentence triples through a sentence encoder, corresponding sentence feature vectors (x1, x2, x3) are obtained through fc calculation; xi are all D-dimensional vectors, and xi are now divided into B segments, each segment containing a D/B-dimensional vector, denoted as:
xij (i ═ 1,2, 3; j ═ 1, 2.., B). The loss function calculation process is as follows:
and (3) segmental loss:
segment_loss(j)=-log(exp(x[1][j]*x[2][j])/(exp(x[1][j]*x[2][j])+(exp(x[1][j]*x[3][j])))
hash loss:
Figure BDA0002067311610000052
the hash function model adds the following two key information in the loss function:
after the hash function is binarized (0/1 digital semantic signatures are formed), the semantic signature segments corresponding to similar sentences are similar, and the semantic signature segments corresponding to dissimilar sentences are dissimilar;
after the hash function is binarized (0/1 digital semantic signatures are formed), the original distance information of the sentence vector is kept (i.e., two similar values are avoided as much as possible, one is binarized into 1, and the other is binarized into 0).
Referring to the loss function formula of the hash function, wherein the segment classification guarantees the first point: the semantic signature segments corresponding to the similar sentences are similar, and the semantic signature segments corresponding to the dissimilar sentences are dissimilar; the segmentation classification uses a dot product (#) form, so that the second point is ensured, and the original distance information of the sentence vector is still kept after the sentence vector is binarized by the hash function.
In addition, when the sentence vectors are obtained, a sentence encoder can also be obtained by adopting a machine learning method, and the sentence encoder can enable sentences with similar semantics and corresponding sentence vectors to be relatively similar.
Figure 1 is a joint training model of a sentence coder and a hash function model,
the learning process for sentence encoder:
as shown in fig. 1, the data to be trained is massive text data of various industries, each industry has a plurality of application scenarios, and each scenario corresponds to a set of sentence sets. The learning process of the sentence encoder is to use the labeled text data to learn the vector representation of the sentence, so that the sentence vectors corresponding to the sentences with similar semantics are also similar.
Data preprocessing: and grouping according to industries, and converting the text data of each industry into a sentence set of triples. For example (q1, q2, q 3); wherein q1, q2 and q3 are sentences of the same industry, q1 and q2 are sentences of the same scene, and q1 and q3 are sentences of different scenes.
Model structure: the model mainly comprises two parts: sentence vector extraction and sentence similarity loss function. Inputting sentences (each sentence comprises a plurality of words wi, i-1,2 …, n), and obtaining a word vector matrix through a two-way LSTM (Long Short-Term Memory network); and then an attention layer (algorithm) formed by the context vectors (ui, i ═ 1,2, …, n) independent from the corresponding industries (fields) is used for carrying out context-attention mechanism processing to obtain final vector representation, namely sentence vectors. The model that converts sentences into sentence vectors is called a sentence coder, and the function is denoted as f (x). The sentence similarity loss function is calculated in a mode of classifying according to industries.
Sentence similarity loss:
Figure BDA0002067311610000061
and (3) a model training process: multiple industry data uses the same model. For different industries, there are separate domain context vectors. The model also includes a global context vector, in addition to industry independent context vectors. For each training batch, the field context vector independent of each industry can be trained once, and then the global context vector can be trained once.
In fig. 1, a sentence encoder is learned through sentence similarity by means of multitask learning; and learning the fc function in the hash function model through the hash similarity.
In addition, as shown in fig. 2, in calculating the model loss function of the sentence encoder, the sentence encoder may be subjected to comprehensive optimization training in combination of the sentence similarity loss function based on the sentence vector obtained by the model processing of the sentence encoder and the hash similarity loss function based on the sentence feature vector obtained by the sentence vector via the feature encoder fc.
Model prediction process: judging the industry type of a sentence to be processed, and selecting an industry independent domain context vector to execute attention mechanism processing on the sentence of the existing industry in the model to obtain sentence vector representation; and for sentences of the new industry, selecting a global context vector to execute attention mechanism processing, and obtaining sentence vector representation.
In the scheme, the hash function model is trained, so that after the sentence-based vectors are calculated through the hash function, the binarization vectors (semantic signatures) corresponding to the sentence vectors with similar semantics are similar in segments, the similarity effect of the similar sentences corresponding to the sentences is determined according to the inverted indexes of the semantic signatures, and extra interference is not performed in the process of generating the sentence vectors.
Further, in order to further guarantee the similarity effect of the finally obtained similar sentences, the scheme also provides that a sentence encoder for generating sentence vectors can be obtained through learning, and the sentence encoder generated through learning can enable the sentence vectors generated by sentences with similar semantics to be relatively similar. In the construction of a sentence encoder model, a global context vector and domain context vectors of different industries are introduced to realize the construction of an attention model. Based on the context vectors, the process of realizing sentence vectorization on texts in different fields can be realized by constructing an attention model.
The technical solution of the present application is further illustrated by the following examples.
Example one
Fig. 3 is a flowchart of a text similarity recognition method according to an embodiment of the present invention, which includes the following steps:
s310, sentence vectors corresponding to sentences in the given sentence set are obtained.
Acquiring a sentence set to be processed, removing meaningless symbols and punctuations in the sentence set, and carrying out normalization processing on numbers and URL links to remove duplication; and segmenting each sentence, and forming a sentence vector of each sentence by the segmentation through a sentence coder.
In principle, the encoding principle of the sentence encoder is not limited, and in order to further ensure the similar effect of the finally determined similar sentences, the scheme preferably uses the model structure of the encoder shown in fig. 1 to calculate the sentence vectors, and specifically includes the processing procedure shown in fig. 4.
S410, each sentence in the given sentence set is processed through LSTM, and a word vector matrix corresponding to the sentence is obtained.
For example, a vector for each word in a sentence may be obtained first (the word vector may be obtained by training in advance through WordVec or FastText, etc.); the word vectors are then passed through the LSTM network to obtain a word vector matrix for each sentence.
However, for new industry data, there may be a large number of industry words, which are likely not in the preset vocabulary, and there are a large number of out-of-vocabulary (OOV) situations, thereby affecting the expression of sentence vectors. According to the scheme, a word-level input mode can be used, and the OOV problem and the word frequency imbalance problem can be effectively solved.
S420, processing the word vector matrix through a model algorithm in the attention model to obtain a sentence vector corresponding to the sentence; the attention model comprises a global context vector and at least one domain context vector of a domain, and the model algorithm is formed based on the global context vector or the domain context vector.
For example, a context vector may be determined from the global context vector or the domain context vector according to the industry domain to which the sentence to be processed belongs, and a model algorithm in the attention model may be formed based on the context vector.
Wherein, each industry (domain) independent domain context vector comprises: in the sentence of the corresponding domain, attention weight setting information of each target word. The global context vector includes: in a sentence in an unknown domain, attention weight setting information of each target word.
In an actual application scene, the difference of the data volume of different industries is large, and some industries only have thousands of marked data. Therefore, if each industry trains a model separately, some industries with small data volume may have poor model performance. In the scheme, a plurality of industry data are learned by using one model, and an attention mechanism is introduced by setting a domain context vector and a global context vector for each industry, so that information of other industries can be well transferred to a target industry.
Accordingly, the processing procedure of the emerging-model algorithm may specifically be: firstly, judging whether the attention model contains a domain context vector corresponding to a domain to which a current sentence to be processed belongs; if yes, selecting a corresponding domain context vector to form a model algorithm; and if not, selecting a global context vector to form a model algorithm.
Then, the word vector matrix and a model algorithm formed based on the independent domain context vector or the global context vector of the industry (domain) to which the sentence belongs are subjected to context attention mechanism processing to obtain a final vector representation, namely, the sentence vector.
S320, coding each sentence vector by adopting a feature coder obtained by pre-learning to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar.
In the scheme, the model structure of the feature encoder is not limited, and can be represented by a function fc, for example, the model structure can be a hash encoder based on a hash algorithm, and the hash encoder can perform dimension reduction on an encoded vector to better adapt to the requirement of data to be processed in subsequent vector calculation. After the sentence vectors are subjected to fc coding, sentence characteristic vectors with specified dimensions can be formed, and the sentence characteristic vectors meet the following conditions: sentence characteristic vectors generated by sentence vectors with similar semantics are correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar. In this way, by means of the segmentation similarity, the overall similarity is ensured, so that in step S330, when generating semantic signatures based on the dimension values of the sentence feature vectors, the generated semantic signatures of the sentence feature vectors with similar semantics are also similar to the corresponding vector segments.
S330, generating semantic signatures of corresponding sentences based on positive and negative characteristics of all dimension values in the sentence characteristic vector, and constructing the reverse indexes from the semantic signatures to the corresponding sentences.
For the sentence feature vector corresponding to each sentence in a given sentence set, the sentence vector can be converted into 0/1 digital semantic signatures with fixed dimensions according to the positive and negative of the numerical value of the dimension, such as conversion of dimension value greater than or equal to 0 into 1 and conversion of dimension value less than 0 into 0, so as to form the semantic signature. For example, the sentence feature vector is (3, 5, -4, -2, 1, 6), and the corresponding semantic signature is (1, 1, 0, 0, 1, 1). Then, an inverted index of semantic signatures to corresponding sentences is constructed.
When the semantic signature index is constructed, the semantic signature can be segmented according to the segmentation standard of the sentence feature vector, and the reverse index from each semantic signature segment to the corresponding sentence is constructed. Therefore, the size of the sentence set indexed by the same semantic signature fragment can be increased, and the possibility of searching similar sentences subsequently is improved.
S340, determining similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures.
For example, sentences indexed by similar semantic signatures may form similar sentence groups according to the similarity between the semantic signatures, and in each similar sentence group, a similar sentence corresponding to each sentence is determined.
For another example, based on the segmentation of the semantic signature in step S330, the sentences indexed by the vector segments with the same vector may form similar sentence groups on the corresponding vector segments, and in each similar sentence group, the similar sentence corresponding to each sentence is determined. That is, according to a given semantic signature segment and its indexed sentence set, for each sentence in the sentence set, a corresponding similar sentence is selected from the remaining sentences in the sentence set.
Further, when the similar sentence of each sentence is selected in units of the sentence set indexed by the semantic signature, the cosine value of the sentence vector corresponding to the sentence and each remaining sentence is calculated for each sentence in the sentence set, and the corresponding similar sentence is selected from the remaining sentences according to the cosine value. For example, selecting the Top-K remaining sentences with the smallest cosine value as the similar sentences of the current sentence.
The text similarity recognition method provided by the invention has the advantages that the sentence vectors of sentences in a given sentence set are subjected to feature coding, so that the sentence feature vectors generated by the sentence vectors with similar semantics are correspondingly similar on each vector segment, otherwise, the sentence feature vectors are not similar; when the semantic signatures of sentences are generated based on the positive and negative characteristics of all dimension values in the sentence characteristic vectors, because the sentence characteristic vectors with similar semantics are similar in segments, the semantic signatures generated by the sentence characteristic vectors with similar semantics can also realize segment similarity as much as possible, thereby realizing overall similarity; when the semantic signatures are constructed to the reverse indexes of the corresponding sentences, because the semantic signatures corresponding to the sentences with similar semantics are also similar, the similarity between the determined similar sentences can be effectively ensured when the similar sentences corresponding to the sentences in the sentence set are determined according to the reverse indexes of the semantic signatures.
Further, when obtaining sentence vectors corresponding to each sentence in a given sentence set, the scheme adopts a sentence vectorization model combining an LSTM and an attention model, and introduces a global context vector and a domain context vector into the attention model to realize similarity of sentence vectors generated by sentences with similar semantics as much as possible, thereby assisting in better realizing consistency between the semantic similarity and the sentence feature vector similarity when forming sentence feature vectors subsequently.
Example two
As shown in fig. 5, which is a structural diagram of a text similarity recognition apparatus according to an embodiment of the present invention, the method steps shown in fig. 3 can be controlled to be executed, and the method steps include:
a sentence vector obtaining module 510, configured to obtain a sentence vector corresponding to each sentence in a given sentence set;
a sentence vector encoding module 520, configured to encode each sentence vector by using a feature encoder obtained through pre-learning, and generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar;
a signature index construction module 530, configured to generate a semantic signature of a corresponding sentence based on positive and negative features of each dimension value in the sentence feature vector, and construct an inverted index from the semantic signature to the corresponding sentence;
and a similar sentence determining module 540, configured to determine a similar sentence corresponding to each sentence in the sentence set according to the inverted index of each semantic signature.
Further, the signature index constructing module 530 may be configured to segment the semantic signatures according to the segmentation standard of the sentence feature vector, and construct an inverted index from each semantic signature segment to the corresponding sentence
Further, the similar sentence determining module 540 may be configured to select, for each sentence in the sentence set, a corresponding similar sentence from the remaining sentences in the sentence set according to the given semantic signature segment and the sentence set indexed by the semantic signature segment.
Further, the similar sentence determining module 540 may be configured to calculate, for each sentence in the sentence set, a cosine value of a sentence vector corresponding to the sentence and each remaining sentence, and select a corresponding similar sentence from the remaining sentences according to the cosine value.
Further, as shown in fig. 6, in the text similarity recognition apparatus shown in fig. 5, the sentence vector obtaining module 510 may include:
a word vector obtaining unit 610, configured to perform LSTM processing on each sentence in the given sentence set to obtain a word vector matrix corresponding to the sentence;
a sentence vector obtaining unit 620, configured to process the word vector matrix through a model algorithm in the attention model to obtain a sentence vector corresponding to the sentence;
the attention model comprises a global context vector and at least one domain context vector of a domain, and the model algorithm is formed based on the global context vector or the domain context vector.
Further, the sentence vector obtaining module 510 may further include:
a model algorithm building unit 630, configured to determine whether the attention model includes a domain context vector corresponding to a domain to which a current sentence to be processed belongs;
if yes, selecting a corresponding domain context vector to form a model algorithm;
and if not, selecting a global context vector to form a model algorithm.
Further, the feature encoder may be a hash encoder.
The text similarity recognition device provided by the invention has the advantages that the sentence vectors of sentences in a given sentence set are subjected to feature coding, so that the sentence feature vectors generated by the sentence vectors with similar semantics are correspondingly similar on each vector segment, otherwise, the sentence feature vectors are not similar; when the semantic signatures of sentences are generated based on the positive and negative characteristics of all dimension values in the sentence characteristic vectors, because the sentence characteristic vectors with similar semantics are similar in segments, the semantic signatures generated by the sentence characteristic vectors with similar semantics can also realize segment similarity as much as possible, thereby realizing overall similarity; when the semantic signatures are constructed to the reverse indexes of the corresponding sentences, because the semantic signatures corresponding to the sentences with similar semantics are also similar, the similarity between the determined similar sentences can be effectively ensured when the similar sentences corresponding to the sentences in the sentence set are determined according to the reverse indexes of the semantic signatures.
Further, when obtaining sentence vectors corresponding to each sentence in a given sentence set, the scheme adopts a sentence vectorization model combining an LSTM and an attention model, and introduces a global context vector and a domain context vector into the attention model to realize similarity of sentence vectors generated by sentences with similar semantics as much as possible, thereby assisting in better realizing consistency between the semantic similarity and the sentence feature vector similarity when forming sentence feature vectors subsequently.
EXAMPLE III
The foregoing embodiment describes an overall architecture of a text similarity recognition apparatus, where the functions of the apparatus can be implemented by an electronic device, as shown in fig. 7, which is a schematic structural diagram of the electronic device according to an embodiment of the present invention, and specifically includes: a memory 710 and a processor 720.
And a memory 710 for storing programs.
In addition to the programs described above, the memory 710 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 710 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
A processor 720, coupled to the memory 710, for executing the program in the memory 710, the program executing the steps of the text similarity recognition method described in the previous embodiment.
Further, the processor 720 may also include various modules described in the foregoing embodiments to perform the operation of text similarity recognition, and the memory 710 may be used, for example, to store data required by the modules to perform the operation and/or output data.
The above specific processing operations have been described in detail in the foregoing embodiments, and are not described again here.
Further, as shown in fig. 7, the electronic device may further include: communication component 730, power component 740, audio component 750, display 760, and other components. Only some of the components are schematically shown in fig. 7, and the electronic device is not meant to include only the components shown in fig. 7.
The communication component 730 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 730 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 730 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
A power supply assembly 740 that provides power to the various components of the electronic device. The power components 740 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.
The audio component 750 is configured to output and/or input audio signals. For example, the audio component 750 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 710 or transmitted via the communication component 730. In some embodiments, audio assembly 750 also includes a speaker for outputting audio signals.
Display 760 comprises a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
Example four
Fig. 8 is a flowchart of a text vectorization processing method according to an embodiment of the present invention, which includes the following steps:
s810, performing word vector conversion on the sentence to be processed to obtain a word vector matrix corresponding to the sentence.
When the sentence is subjected to word vector conversion, the word vector matrix can be obtained through the LSTM processing. In practical application scenarios, word vector conversion processing can also be performed by using other models obtained through training by using, for example, a deep learning algorithm.
S820, processing the word vector matrix through a model algorithm in the attention model to obtain a sentence vector corresponding to the sentence;
the attention model comprises a global context vector and at least one domain context vector of a domain, and the model algorithm is formed based on the global context vector or the domain context vector.
For example, a context vector may be determined from the global context vector or the domain context vector according to the industry domain to which the sentence to be processed belongs, and a model algorithm in the attention model may be formed based on the context vector.
Wherein, each industry (domain) independent domain context vector comprises: in the sentence of the corresponding domain, attention weight setting information of each target word. The global context vector includes: in a sentence in an unknown domain, attention weight setting information of each target word.
In an actual application scene, the difference of the data volume of different industries is large, and some industries only have thousands of marked data. Therefore, if each industry trains a model separately, some industries with small data volume may have poor model performance. In the scheme, a plurality of industry data are learned by using one model, and an attention mechanism is introduced by setting a domain context vector and a global context vector for each industry, so that information of other industries can be well transferred to a target industry.
Accordingly, the processing procedure of the emerging-model algorithm may specifically be: firstly, judging whether the attention model contains a domain context vector corresponding to a domain to which a current sentence to be processed belongs; if yes, selecting a corresponding domain context vector to form a model algorithm; and if not, selecting a global context vector to form a model algorithm.
Then, the word vector matrix and a model algorithm formed based on the independent domain context vector or the global context vector of the industry (domain) to which the sentence belongs are subjected to context attention mechanism processing to obtain a final vector representation, namely, the sentence vector.
According to the text vectorization processing method, the attention mechanism of the field context and the global context of different industries is introduced, so that the vectorization processing of sentences in different fields can be realized through the same sentence vectorization model, model training does not need to be carried out on different fields, the complexity of model training in the sentence vectorization processing process is effectively reduced, and the practicability and the efficiency of text vectorization processing are improved.
EXAMPLE five
Fig. 9 is a schematic structural diagram of a text vectorization processing apparatus according to an embodiment of the present invention, which can be used to execute the method steps shown in fig. 8, and includes:
a word vector obtaining module 910, configured to perform word vector conversion on a sentence to be processed to obtain a word vector matrix corresponding to the sentence;
a sentence vector obtaining module 920, configured to process the word vector matrix through a model algorithm in the attention model to obtain a sentence vector corresponding to the sentence;
wherein the attention model comprises a global context vector and a domain context vector of at least one domain, and the model algorithm is formed based on the global context vector or one of the domain context vectors.
Further, as shown in fig. 10, the text vectorization processing apparatus shown in fig. 9 may further include:
the model algorithm building module 101 is used for judging whether the attention model contains a domain context vector corresponding to a domain to which a sentence to be processed belongs;
if yes, selecting a corresponding domain context vector to form a model algorithm;
and if not, selecting a global context vector to form the model algorithm.
The text vectorization processing device provided by the invention can realize vectorization processing of sentences in different fields through the same sentence vectorization model by introducing the attention mechanism of the field context and the global context of different industries, and does not need to respectively carry out model training aiming at different fields, thereby effectively reducing the complexity of model training in the sentence vectorization processing process and simultaneously improving the practicability and efficiency of text vectorization processing.
EXAMPLE six
The foregoing embodiment describes an overall architecture of a text vectorization processing apparatus, where functions of the apparatus can be implemented by an electronic device, as shown in fig. 11, which is a schematic structural diagram of the electronic device according to an embodiment of the present invention, and specifically includes: a memory 111 and a processor 112.
The memory 111 stores programs.
In addition to the above-described programs, the memory 111 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 111 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
A processor 112, coupled to the memory 111, for executing the program in the memory 111, the program executing the operation steps of the text vectorization processing method described in the foregoing embodiment.
Further, the processor 112 may also include various modules described in the foregoing embodiments to perform operations of the text vectorization processing, and the memory 111 may be used, for example, to store data required for the modules to perform the operations and/or output data.
The above specific processing operations have been described in detail in the foregoing embodiments, and are not described again here.
Further, as shown in fig. 11, the electronic device may further include: communication components 113, power components 114, audio components 115, display 116, and other components. Only some of the components are schematically shown in fig. 11, and it is not meant that the electronic device includes only the components shown in fig. 11.
The communication component 113 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 113 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 113 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
A power supply component 114 that provides power to the various components of the electronic device. The power components 114 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for an electronic device.
Audio component 115 is configured to output and/or input audio signals. For example, audio component 115 may include a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 111 or transmitted via the communication component 113. In some embodiments, audio component 115 also includes a speaker for outputting audio signals.
The display 116 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (13)

1. A text similarity recognition method comprises the following steps:
obtaining sentence vectors corresponding to sentences in a given sentence set;
coding each sentence vector by adopting a feature coder obtained by learning in advance to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar;
generating semantic signatures of corresponding sentences based on positive and negative characteristics of all dimension values in the sentence characteristic vector, and constructing reverse indexes from the semantic signatures to the corresponding sentences;
and determining similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures.
2. The method of claim 1, wherein the generating semantic signatures for respective sentences based on positive and negative features of each dimension value in the sentence feature vector and constructing an inverted index of semantic signatures into respective sentences comprises:
and segmenting the semantic signatures according to the segmentation standard of the sentence characteristic vector, and constructing an inverted index from each semantic signature segment to a corresponding sentence.
3. The method of claim 2, wherein the determining similar sentences corresponding to each sentence in the set of sentences according to the inverted index of each semantic signature comprises:
according to the given semantic signature fragment and the sentence set indexed by the semantic signature fragment, aiming at each sentence in the sentence set, selecting a corresponding similar sentence from the rest sentences in the sentence set.
4. The method of claim 3, wherein for each sentence in the set of sentences, selecting a corresponding similar sentence from the remaining sentences in the set of sentences comprises:
and aiming at each sentence in the sentence set, calculating the cosine value of the sentence vector corresponding to the sentence and each residual sentence, and selecting corresponding similar sentences from the residual sentences according to the cosine value.
5. The method of claim 1, wherein the obtaining a sentence vector corresponding to each sentence in the given set of sentences comprises:
processing each sentence in the given sentence set through LSTM to obtain a word vector matrix corresponding to the sentence;
processing the word vector matrix through a model algorithm in an attention model to obtain a sentence vector corresponding to the sentence;
wherein the attention model comprises a global context vector and a domain context vector of at least one domain, and the model algorithm is formed based on the global context vector or one of the domain context vectors.
6. The method of claim 5, wherein forming the model algorithm comprises:
judging whether the attention model contains a domain context vector corresponding to the domain to which the current sentence to be processed belongs;
if yes, selecting a corresponding field context vector to form the model algorithm;
and if not, selecting the global context vector to form the model algorithm.
7. The method of claim 1, wherein the feature encoder is a hash encoder.
8. A text similarity recognition apparatus comprising:
a sentence vector acquisition module for acquiring sentence vectors corresponding to each sentence in a given sentence set;
the sentence vector coding module is used for coding each sentence vector by adopting a feature coder obtained by pre-learning to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar;
the signature index construction module is used for generating semantic signatures of corresponding sentences based on positive and negative features of all dimension values in the sentence feature vector and constructing reverse indexes from the semantic signatures to the corresponding sentences;
and the similar sentence determining module is used for determining the similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures.
9. A text vectorization processing method, comprising:
carrying out word vector conversion on a sentence to be processed to obtain a word vector matrix corresponding to the sentence;
processing the word vector matrix through a model algorithm in an attention model to obtain a sentence vector corresponding to the sentence;
wherein the attention model comprises a global context vector and a domain context vector of at least one domain, and the model algorithm is formed based on the global context vector or one of the domain context vectors.
10. The method of claim 9, wherein forming the model algorithm comprises:
judging whether the attention model contains a domain context vector corresponding to the domain to which the sentence to be processed belongs;
if yes, selecting a corresponding field context vector to form the model algorithm;
and if not, selecting the global context vector to form the model algorithm.
11. A text vectorization processing apparatus comprising:
the word vector acquisition module is used for carrying out word vector conversion on the sentence to be processed to obtain a word vector matrix corresponding to the sentence;
a sentence vector acquisition module, configured to process the word vector matrix through a model algorithm in an attention model to obtain a sentence vector corresponding to the sentence;
wherein the attention model comprises a global context vector and a domain context vector of at least one domain, and the model algorithm is formed based on the global context vector or one of the domain context vectors.
12. An electronic device, comprising:
a memory for storing a program;
a processor coupled to the memory for executing the program, the program when running executing the text similarity recognition method of any one of claims 1-7.
13. An electronic device, comprising:
a memory for storing a program;
a processor coupled to the memory for executing the program, the program when running executing the text vectorization processing method of any of claims 9-10.
CN201910425307.9A 2019-05-21 2019-05-21 Text similarity recognition method and device and electronic equipment Pending CN112069790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910425307.9A CN112069790A (en) 2019-05-21 2019-05-21 Text similarity recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910425307.9A CN112069790A (en) 2019-05-21 2019-05-21 Text similarity recognition method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN112069790A true CN112069790A (en) 2020-12-11

Family

ID=73657843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910425307.9A Pending CN112069790A (en) 2019-05-21 2019-05-21 Text similarity recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112069790A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065011A (en) * 2021-03-17 2021-07-02 北京沃东天骏信息技术有限公司 Picture determination method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897263A (en) * 2016-12-29 2017-06-27 北京光年无限科技有限公司 Robot dialogue exchange method and device based on deep learning
US20170322654A1 (en) * 2012-08-24 2017-11-09 Ricky Steven Pionkowski Touchphrase interface environment
CN108717407A (en) * 2018-05-11 2018-10-30 北京三快在线科技有限公司 Entity vector determines method and device, information retrieval method and device
CN108804417A (en) * 2018-05-21 2018-11-13 山东科技大学 A kind of documentation level sentiment analysis method based on specific area emotion word
US20190057075A1 (en) * 2017-08-17 2019-02-21 International Business Machines Corporation Domain-specific lexically-driven pre-parser

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170322654A1 (en) * 2012-08-24 2017-11-09 Ricky Steven Pionkowski Touchphrase interface environment
CN106897263A (en) * 2016-12-29 2017-06-27 北京光年无限科技有限公司 Robot dialogue exchange method and device based on deep learning
US20190057075A1 (en) * 2017-08-17 2019-02-21 International Business Machines Corporation Domain-specific lexically-driven pre-parser
CN108717407A (en) * 2018-05-11 2018-10-30 北京三快在线科技有限公司 Entity vector determines method and device, information retrieval method and device
CN108804417A (en) * 2018-05-21 2018-11-13 山东科技大学 A kind of documentation level sentiment analysis method based on specific area emotion word

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"聚焦提升企业竞争力的途径", 汽车纵横, no. 12 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065011A (en) * 2021-03-17 2021-07-02 北京沃东天骏信息技术有限公司 Picture determination method and device
CN113065011B (en) * 2021-03-17 2024-01-16 北京沃东天骏信息技术有限公司 Picture determination method and device

Similar Documents

Publication Publication Date Title
CN109299273B (en) Multi-source multi-label text classification method and system based on improved seq2seq model
CN111291566B (en) Event main body recognition method, device and storage medium
JP2022172381A (en) Text extraction method, text extraction model training method, device and equipment
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN112825114A (en) Semantic recognition method and device, electronic equipment and storage medium
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
CN115688920A (en) Knowledge extraction method, model training method, device, equipment and medium
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN112906380A (en) Method and device for identifying role in text, readable medium and electronic equipment
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN115810068A (en) Image description generation method and device, storage medium and electronic equipment
CN113051384B (en) User portrait extraction method based on dialogue and related device
CN112069790A (en) Text similarity recognition method and device and electronic equipment
CN112270184A (en) Natural language processing method, device and storage medium
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
Qi et al. Video captioning via a symmetric bidirectional decoder
CN112528674B (en) Text processing method, training device, training equipment and training equipment for model and storage medium
CN115408494A (en) Text matching method integrating multi-head attention alignment
CN113255829A (en) Zero sample image target detection method and device based on deep learning
CN114492410A (en) Contract information extraction method and device
CN114417891A (en) Reply sentence determination method and device based on rough semantics and electronic equipment
CN112559750A (en) Text data classification method and device, nonvolatile storage medium and processor
CN114722817A (en) Event processing method and device
CN113792120A (en) Graph network construction method and device and reading understanding method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination