CN112069790A

CN112069790A - Text similarity recognition method and device and electronic equipment

Info

Publication number: CN112069790A
Application number: CN201910425307.9A
Authority: CN
Inventors: 陈克寒
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2020-12-11

Abstract

The embodiment of the invention provides a text similarity recognition method, a text similarity recognition device and electronic equipment, wherein the text similarity recognition method comprises the following steps: obtaining sentence vectors corresponding to sentences in a given sentence set; coding each sentence vector by adopting a feature coder obtained by learning in advance to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar; generating semantic signatures of corresponding sentences based on positive and negative characteristics of all dimension values in the sentence characteristic vector, and constructing reverse indexes from the semantic signatures to the corresponding sentences; and determining similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures. The scheme of the embodiment of the invention can effectively improve the similar effect between the identified similar sentences.

Description

Text similarity recognition method and device and electronic equipment

Technical Field

The present application relates to the field of computers, and in particular, to a text similarity recognition method and apparatus, and an electronic device.

Background

A more effective machine learning algorithm relies mostly on a large amount of high quality labeled training data. Obtaining these high quality label data has always been a problem of high labor costs in the field of machine learning. For massive text data, if business personnel directly mark labels, the time and labor are extremely wasted.

In the prior art, similar text data are generally labeled in batches by analyzing the similarity between texts, so that the efficiency of obtaining label data is improved, and the following method is mainly adopted for analyzing the similarity of massive texts:

firstly, for a sentence vector of each sentence in a given sentence set, converting the sentence vector into a 0/1 digital semantic signature with fixed dimension according to the positive and negative values of the dimension; then, based on the index of semantic signatures, preliminarily determining that sentences indexed by the same semantic signature (or semantic signature fragment) form candidate similar sentence groups, and then respectively determining the similar sentences corresponding to each sentence in each candidate similar sentence group.

In the processing process, the semantic signature is used for decomposing the process of searching for the similar sentences from the sentence set into the process of searching for the similar sentences of each sentence in each candidate similar sentence group so as to reduce the complexity of similarity calculation. However, this method cannot ensure that semantic signatures obtained from actually similar sentences are also relatively similar when forming semantic signatures, thereby affecting the similarity effect of the finally determined similar sentences.

Disclosure of Invention

The invention provides a text similarity recognition method and device and electronic equipment, which can effectively improve the similarity effect between recognized similar sentences.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, a text similarity recognition method is provided, including:

obtaining sentence vectors corresponding to sentences in a given sentence set;

coding each sentence vector by adopting a feature coder obtained by learning in advance to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar;

generating semantic signatures of corresponding sentences based on positive and negative characteristics of all dimension values in the sentence characteristic vector, and constructing reverse indexes from the semantic signatures to the corresponding sentences;

and determining similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures.

In a second aspect, an apparatus for recognizing text similarity is provided, including:

a sentence vector acquisition module for acquiring sentence vectors corresponding to each sentence in a given sentence set;

the sentence vector coding module is used for coding each sentence vector by adopting a feature coder obtained by pre-learning to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar;

the signature index construction module is used for generating semantic signatures of corresponding sentences based on positive and negative features of all dimension values in the sentence feature vector and constructing reverse indexes from the semantic signatures to the corresponding sentences;

and the similar sentence determining module is used for determining the similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures.

In a third aspect, a text vectorization processing method is provided, including:

carrying out word vector conversion on a sentence to be processed to obtain a word vector matrix corresponding to the sentence;

processing the word vector matrix through a model algorithm in an attention model to obtain a sentence vector corresponding to the sentence;

wherein the attention model comprises a global context vector and a domain context vector of at least one domain, and the model algorithm is formed based on the global context vector or one of the domain context vectors.

In a fourth aspect, there is provided a text vectorization processing apparatus including:

the word vector acquisition module is used for carrying out word vector conversion on the sentence to be processed to obtain a word vector matrix corresponding to the sentence;

a sentence vector acquisition module, configured to process the word vector matrix through a model algorithm in an attention model to obtain a sentence vector corresponding to the sentence;

In a fifth aspect, an electronic device is provided, comprising:

a memory for storing a program;

and the processor is coupled to the memory and used for executing the program, and the program executes the text similarity identification method provided by the invention when running.

In a sixth aspect, an electronic device is provided, comprising:

a memory for storing a program;

and the processor is coupled to the memory and used for executing the program, and the program executes the text vectorization processing method provided by the invention when running.

The invention provides a text similarity recognition method, a text similarity recognition device and electronic equipment, wherein sentence vectors of sentences in a given sentence set are subjected to feature coding, so that sentence feature vectors generated by sentence vectors with similar semantics are correspondingly similar in each vector segment, otherwise, the sentence feature vectors are not similar; when the semantic signatures of sentences are generated based on the positive and negative characteristics of all dimension values in the sentence characteristic vectors, because the sentence characteristic vectors with similar semantics are similar in segments, the semantic signatures generated by the sentence characteristic vectors with similar semantics can also realize segment similarity as much as possible, thereby realizing overall similarity; when the semantic signatures are constructed to the reverse indexes of the corresponding sentences, because the semantic signatures corresponding to the sentences with similar semantics are also similar, the similarity between the determined similar sentences can be effectively ensured when the similar sentences corresponding to the sentences in the sentence set are determined according to the reverse indexes of the semantic signatures.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic diagram of a training model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a loss function model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a text similarity recognition method according to an embodiment of the present invention;

FIG. 4 is a flow diagram of a method of generating a sentence vector;

FIG. 5 is a first block diagram of a text similarity recognition apparatus according to an embodiment of the present invention;

FIG. 6 is a second block diagram of a text similarity recognition apparatus according to an embodiment of the present invention;

FIG. 7 is a first schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 8 is a flowchart of a text vectorization processing method according to an embodiment of the present invention;

FIG. 9 is a first block diagram of a text vectorization processing apparatus according to an embodiment of the present invention;

FIG. 10 is a block diagram of a second exemplary embodiment of a text vectorization processing apparatus;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention improves the defect that in the prior art, when the similarity of sentences is determined through semantic tags, because the semantic tags generated by sentence vectors with similar semantics cannot be ensured to be similar through the existing hash function, the similarity effect of the similar sentences determined after reverse indexing is difficult to ensure through the semantic tags, the core idea is that in the process of forming the semantic tags based on the sentence vectors, the sentence vectors are secondarily encoded through a feature encoder to generate corresponding sentence feature vectors, and the feature encoder can be obtained through learning, so that the sentence feature vectors generated by the sentence vectors with similar semantics are correspondingly similar on each feature vector segment, otherwise, the sentence feature vectors are not similar; thus, when generating semantic signatures of corresponding sentences based on the positive and negative features of each dimension value in the sentence feature vectors, the semantic signatures corresponding to the sentence feature vectors with similar semantics are also similar on the corresponding vector segments; by constructing the semantic signatures to the inverted indexes of the corresponding sentences, the sentences indexed by the similar semantic signatures are also relatively similar, so that the similarity between the determined similar sentences can be better ensured according to the inverted indexes of the semantic signatures.

When the characteristic encoder is trained, the characteristic encoder (hereinafter denoted by fc) and the existing semantic signature algorithm can be fused into an integral hash function model for training. The sentence vectors are converted into sentence characteristic vectors through fc coding by training fc, and the sentence characteristic vectors are subjected to the existing hash function to obtain the following specific characteristics of the semantic signature: and the semantically similar sentence vectors are similar to the corresponding generated semantic signatures.

The hash function model structure is as follows:

hash (x) ═ sgn (fc (x)), where

Where fc is the full connectivity layer.

Loss function: constructing sentence triples (q1, q2, q3), wherein: q1 and q2 are similar semantically; q1 and q3 are not similar semantically. After sentence vectors are obtained from the sentence triples through a sentence encoder, corresponding sentence feature vectors (x1, x2, x3) are obtained through fc calculation; xi are all D-dimensional vectors, and xi are now divided into B segments, each segment containing a D/B-dimensional vector, denoted as:

xij (i ═ 1,2, 3; j ═ 1, 2.., B). The loss function calculation process is as follows:

and (3) segmental loss:

segment_loss(j)＝-log(exp(x[1][j]*x[2][j])/(exp(x[1][j]*x[2][j])+(exp(x[1][j]*x[3][j])))

hash loss:

the hash function model adds the following two key information in the loss function:

after the hash function is binarized (0/1 digital semantic signatures are formed), the semantic signature segments corresponding to similar sentences are similar, and the semantic signature segments corresponding to dissimilar sentences are dissimilar;

after the hash function is binarized (0/1 digital semantic signatures are formed), the original distance information of the sentence vector is kept (i.e., two similar values are avoided as much as possible, one is binarized into 1, and the other is binarized into 0).

Referring to the loss function formula of the hash function, wherein the segment classification guarantees the first point: the semantic signature segments corresponding to the similar sentences are similar, and the semantic signature segments corresponding to the dissimilar sentences are dissimilar; the segmentation classification uses a dot product (#) form, so that the second point is ensured, and the original distance information of the sentence vector is still kept after the sentence vector is binarized by the hash function.

In addition, when the sentence vectors are obtained, a sentence encoder can also be obtained by adopting a machine learning method, and the sentence encoder can enable sentences with similar semantics and corresponding sentence vectors to be relatively similar.

Figure 1 is a joint training model of a sentence coder and a hash function model,

the learning process for sentence encoder:

as shown in fig. 1, the data to be trained is massive text data of various industries, each industry has a plurality of application scenarios, and each scenario corresponds to a set of sentence sets. The learning process of the sentence encoder is to use the labeled text data to learn the vector representation of the sentence, so that the sentence vectors corresponding to the sentences with similar semantics are also similar.

Data preprocessing: and grouping according to industries, and converting the text data of each industry into a sentence set of triples. For example (q1, q2, q 3); wherein q1, q2 and q3 are sentences of the same industry, q1 and q2 are sentences of the same scene, and q1 and q3 are sentences of different scenes.

Model structure: the model mainly comprises two parts: sentence vector extraction and sentence similarity loss function. Inputting sentences (each sentence comprises a plurality of words wi, i-1,2 …, n), and obtaining a word vector matrix through a two-way LSTM (Long Short-Term Memory network); and then an attention layer (algorithm) formed by the context vectors (ui, i ═ 1,2, …, n) independent from the corresponding industries (fields) is used for carrying out context-attention mechanism processing to obtain final vector representation, namely sentence vectors. The model that converts sentences into sentence vectors is called a sentence coder, and the function is denoted as f (x). The sentence similarity loss function is calculated in a mode of classifying according to industries.

Sentence similarity loss:

and (3) a model training process: multiple industry data uses the same model. For different industries, there are separate domain context vectors. The model also includes a global context vector, in addition to industry independent context vectors. For each training batch, the field context vector independent of each industry can be trained once, and then the global context vector can be trained once.

In fig. 1, a sentence encoder is learned through sentence similarity by means of multitask learning; and learning the fc function in the hash function model through the hash similarity.

In addition, as shown in fig. 2, in calculating the model loss function of the sentence encoder, the sentence encoder may be subjected to comprehensive optimization training in combination of the sentence similarity loss function based on the sentence vector obtained by the model processing of the sentence encoder and the hash similarity loss function based on the sentence feature vector obtained by the sentence vector via the feature encoder fc.

Model prediction process: judging the industry type of a sentence to be processed, and selecting an industry independent domain context vector to execute attention mechanism processing on the sentence of the existing industry in the model to obtain sentence vector representation; and for sentences of the new industry, selecting a global context vector to execute attention mechanism processing, and obtaining sentence vector representation.

In the scheme, the hash function model is trained, so that after the sentence-based vectors are calculated through the hash function, the binarization vectors (semantic signatures) corresponding to the sentence vectors with similar semantics are similar in segments, the similarity effect of the similar sentences corresponding to the sentences is determined according to the inverted indexes of the semantic signatures, and extra interference is not performed in the process of generating the sentence vectors.

Further, in order to further guarantee the similarity effect of the finally obtained similar sentences, the scheme also provides that a sentence encoder for generating sentence vectors can be obtained through learning, and the sentence encoder generated through learning can enable the sentence vectors generated by sentences with similar semantics to be relatively similar. In the construction of a sentence encoder model, a global context vector and domain context vectors of different industries are introduced to realize the construction of an attention model. Based on the context vectors, the process of realizing sentence vectorization on texts in different fields can be realized by constructing an attention model.

The technical solution of the present application is further illustrated by the following examples.

Example one

Fig. 3 is a flowchart of a text similarity recognition method according to an embodiment of the present invention, which includes the following steps:

s310, sentence vectors corresponding to sentences in the given sentence set are obtained.

Acquiring a sentence set to be processed, removing meaningless symbols and punctuations in the sentence set, and carrying out normalization processing on numbers and URL links to remove duplication; and segmenting each sentence, and forming a sentence vector of each sentence by the segmentation through a sentence coder.

In principle, the encoding principle of the sentence encoder is not limited, and in order to further ensure the similar effect of the finally determined similar sentences, the scheme preferably uses the model structure of the encoder shown in fig. 1 to calculate the sentence vectors, and specifically includes the processing procedure shown in fig. 4.

S410, each sentence in the given sentence set is processed through LSTM, and a word vector matrix corresponding to the sentence is obtained.

For example, a vector for each word in a sentence may be obtained first (the word vector may be obtained by training in advance through WordVec or FastText, etc.); the word vectors are then passed through the LSTM network to obtain a word vector matrix for each sentence.

However, for new industry data, there may be a large number of industry words, which are likely not in the preset vocabulary, and there are a large number of out-of-vocabulary (OOV) situations, thereby affecting the expression of sentence vectors. According to the scheme, a word-level input mode can be used, and the OOV problem and the word frequency imbalance problem can be effectively solved.

S420, processing the word vector matrix through a model algorithm in the attention model to obtain a sentence vector corresponding to the sentence; the attention model comprises a global context vector and at least one domain context vector of a domain, and the model algorithm is formed based on the global context vector or the domain context vector.

For example, a context vector may be determined from the global context vector or the domain context vector according to the industry domain to which the sentence to be processed belongs, and a model algorithm in the attention model may be formed based on the context vector.

Wherein, each industry (domain) independent domain context vector comprises: in the sentence of the corresponding domain, attention weight setting information of each target word. The global context vector includes: in a sentence in an unknown domain, attention weight setting information of each target word.

In an actual application scene, the difference of the data volume of different industries is large, and some industries only have thousands of marked data. Therefore, if each industry trains a model separately, some industries with small data volume may have poor model performance. In the scheme, a plurality of industry data are learned by using one model, and an attention mechanism is introduced by setting a domain context vector and a global context vector for each industry, so that information of other industries can be well transferred to a target industry.

Accordingly, the processing procedure of the emerging-model algorithm may specifically be: firstly, judging whether the attention model contains a domain context vector corresponding to a domain to which a current sentence to be processed belongs; if yes, selecting a corresponding domain context vector to form a model algorithm; and if not, selecting a global context vector to form a model algorithm.

Then, the word vector matrix and a model algorithm formed based on the independent domain context vector or the global context vector of the industry (domain) to which the sentence belongs are subjected to context attention mechanism processing to obtain a final vector representation, namely, the sentence vector.

S320, coding each sentence vector by adopting a feature coder obtained by pre-learning to generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar.

In the scheme, the model structure of the feature encoder is not limited, and can be represented by a function fc, for example, the model structure can be a hash encoder based on a hash algorithm, and the hash encoder can perform dimension reduction on an encoded vector to better adapt to the requirement of data to be processed in subsequent vector calculation. After the sentence vectors are subjected to fc coding, sentence characteristic vectors with specified dimensions can be formed, and the sentence characteristic vectors meet the following conditions: sentence characteristic vectors generated by sentence vectors with similar semantics are correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar. In this way, by means of the segmentation similarity, the overall similarity is ensured, so that in step S330, when generating semantic signatures based on the dimension values of the sentence feature vectors, the generated semantic signatures of the sentence feature vectors with similar semantics are also similar to the corresponding vector segments.

S330, generating semantic signatures of corresponding sentences based on positive and negative characteristics of all dimension values in the sentence characteristic vector, and constructing the reverse indexes from the semantic signatures to the corresponding sentences.

For the sentence feature vector corresponding to each sentence in a given sentence set, the sentence vector can be converted into 0/1 digital semantic signatures with fixed dimensions according to the positive and negative of the numerical value of the dimension, such as conversion of dimension value greater than or equal to 0 into 1 and conversion of dimension value less than 0 into 0, so as to form the semantic signature. For example, the sentence feature vector is (3, 5, -4, -2, 1, 6), and the corresponding semantic signature is (1, 1, 0, 0, 1, 1). Then, an inverted index of semantic signatures to corresponding sentences is constructed.

When the semantic signature index is constructed, the semantic signature can be segmented according to the segmentation standard of the sentence feature vector, and the reverse index from each semantic signature segment to the corresponding sentence is constructed. Therefore, the size of the sentence set indexed by the same semantic signature fragment can be increased, and the possibility of searching similar sentences subsequently is improved.

S340, determining similar sentences corresponding to the sentences in the sentence set according to the inverted indexes of the semantic signatures.

For example, sentences indexed by similar semantic signatures may form similar sentence groups according to the similarity between the semantic signatures, and in each similar sentence group, a similar sentence corresponding to each sentence is determined.

For another example, based on the segmentation of the semantic signature in step S330, the sentences indexed by the vector segments with the same vector may form similar sentence groups on the corresponding vector segments, and in each similar sentence group, the similar sentence corresponding to each sentence is determined. That is, according to a given semantic signature segment and its indexed sentence set, for each sentence in the sentence set, a corresponding similar sentence is selected from the remaining sentences in the sentence set.

Further, when the similar sentence of each sentence is selected in units of the sentence set indexed by the semantic signature, the cosine value of the sentence vector corresponding to the sentence and each remaining sentence is calculated for each sentence in the sentence set, and the corresponding similar sentence is selected from the remaining sentences according to the cosine value. For example, selecting the Top-K remaining sentences with the smallest cosine value as the similar sentences of the current sentence.

The text similarity recognition method provided by the invention has the advantages that the sentence vectors of sentences in a given sentence set are subjected to feature coding, so that the sentence feature vectors generated by the sentence vectors with similar semantics are correspondingly similar on each vector segment, otherwise, the sentence feature vectors are not similar; when the semantic signatures of sentences are generated based on the positive and negative characteristics of all dimension values in the sentence characteristic vectors, because the sentence characteristic vectors with similar semantics are similar in segments, the semantic signatures generated by the sentence characteristic vectors with similar semantics can also realize segment similarity as much as possible, thereby realizing overall similarity; when the semantic signatures are constructed to the reverse indexes of the corresponding sentences, because the semantic signatures corresponding to the sentences with similar semantics are also similar, the similarity between the determined similar sentences can be effectively ensured when the similar sentences corresponding to the sentences in the sentence set are determined according to the reverse indexes of the semantic signatures.

Further, when obtaining sentence vectors corresponding to each sentence in a given sentence set, the scheme adopts a sentence vectorization model combining an LSTM and an attention model, and introduces a global context vector and a domain context vector into the attention model to realize similarity of sentence vectors generated by sentences with similar semantics as much as possible, thereby assisting in better realizing consistency between the semantic similarity and the sentence feature vector similarity when forming sentence feature vectors subsequently.

Example two

As shown in fig. 5, which is a structural diagram of a text similarity recognition apparatus according to an embodiment of the present invention, the method steps shown in fig. 3 can be controlled to be executed, and the method steps include:

a sentence vector obtaining module 510, configured to obtain a sentence vector corresponding to each sentence in a given sentence set;

a sentence vector encoding module 520, configured to encode each sentence vector by using a feature encoder obtained through pre-learning, and generate a corresponding sentence feature vector; the sentence characteristic vector comprises a plurality of vector segments, and the characteristic encoder is used for enabling sentence characteristic vectors generated by sentence vectors with similar semantics to be correspondingly similar on each vector segment, otherwise, the sentence characteristic vectors are not similar;

a signature index construction module 530, configured to generate a semantic signature of a corresponding sentence based on positive and negative features of each dimension value in the sentence feature vector, and construct an inverted index from the semantic signature to the corresponding sentence;

and a similar sentence determining module 540, configured to determine a similar sentence corresponding to each sentence in the sentence set according to the inverted index of each semantic signature.

Further, the signature index constructing module 530 may be configured to segment the semantic signatures according to the segmentation standard of the sentence feature vector, and construct an inverted index from each semantic signature segment to the corresponding sentence

Further, the similar sentence determining module 540 may be configured to select, for each sentence in the sentence set, a corresponding similar sentence from the remaining sentences in the sentence set according to the given semantic signature segment and the sentence set indexed by the semantic signature segment.

Further, the similar sentence determining module 540 may be configured to calculate, for each sentence in the sentence set, a cosine value of a sentence vector corresponding to the sentence and each remaining sentence, and select a corresponding similar sentence from the remaining sentences according to the cosine value.

Further, as shown in fig. 6, in the text similarity recognition apparatus shown in fig. 5, the sentence vector obtaining module 510 may include:

a word vector obtaining unit 610, configured to perform LSTM processing on each sentence in the given sentence set to obtain a word vector matrix corresponding to the sentence;

a sentence vector obtaining unit 620, configured to process the word vector matrix through a model algorithm in the attention model to obtain a sentence vector corresponding to the sentence;

the attention model comprises a global context vector and at least one domain context vector of a domain, and the model algorithm is formed based on the global context vector or the domain context vector.

Further, the sentence vector obtaining module 510 may further include:

a model algorithm building unit 630, configured to determine whether the attention model includes a domain context vector corresponding to a domain to which a current sentence to be processed belongs;

if yes, selecting a corresponding domain context vector to form a model algorithm;

and if not, selecting a global context vector to form a model algorithm.

Further, the feature encoder may be a hash encoder.

The text similarity recognition device provided by the invention has the advantages that the sentence vectors of sentences in a given sentence set are subjected to feature coding, so that the sentence feature vectors generated by the sentence vectors with similar semantics are correspondingly similar on each vector segment, otherwise, the sentence feature vectors are not similar; when the semantic signatures of sentences are generated based on the positive and negative characteristics of all dimension values in the sentence characteristic vectors, because the sentence characteristic vectors with similar semantics are similar in segments, the semantic signatures generated by the sentence characteristic vectors with similar semantics can also realize segment similarity as much as possible, thereby realizing overall similarity; when the semantic signatures are constructed to the reverse indexes of the corresponding sentences, because the semantic signatures corresponding to the sentences with similar semantics are also similar, the similarity between the determined similar sentences can be effectively ensured when the similar sentences corresponding to the sentences in the sentence set are determined according to the reverse indexes of the semantic signatures.

EXAMPLE III

The foregoing embodiment describes an overall architecture of a text similarity recognition apparatus, where the functions of the apparatus can be implemented by an electronic device, as shown in fig. 7, which is a schematic structural diagram of the electronic device according to an embodiment of the present invention, and specifically includes: a memory 710 and a processor 720.

And a memory 710 for storing programs.

In addition to the programs described above, the memory 710 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 710 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 720, coupled to the memory 710, for executing the program in the memory 710, the program executing the steps of the text similarity recognition method described in the previous embodiment.

Further, the processor 720 may also include various modules described in the foregoing embodiments to perform the operation of text similarity recognition, and the memory 710 may be used, for example, to store data required by the modules to perform the operation and/or output data.

The above specific processing operations have been described in detail in the foregoing embodiments, and are not described again here.

Further, as shown in fig. 7, the electronic device may further include: communication component 730, power component 740, audio component 750, display 760, and other components. Only some of the components are schematically shown in fig. 7, and the electronic device is not meant to include only the components shown in fig. 7.

The communication component 730 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 730 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 730 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

A power supply assembly 740 that provides power to the various components of the electronic device. The power components 740 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.

The audio component 750 is configured to output and/or input audio signals. For example, the audio component 750 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 710 or transmitted via the communication component 730. In some embodiments, audio assembly 750 also includes a speaker for outputting audio signals.

Display 760 comprises a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Example four

Fig. 8 is a flowchart of a text vectorization processing method according to an embodiment of the present invention, which includes the following steps:

s810, performing word vector conversion on the sentence to be processed to obtain a word vector matrix corresponding to the sentence.

When the sentence is subjected to word vector conversion, the word vector matrix can be obtained through the LSTM processing. In practical application scenarios, word vector conversion processing can also be performed by using other models obtained through training by using, for example, a deep learning algorithm.

S820, processing the word vector matrix through a model algorithm in the attention model to obtain a sentence vector corresponding to the sentence;

According to the text vectorization processing method, the attention mechanism of the field context and the global context of different industries is introduced, so that the vectorization processing of sentences in different fields can be realized through the same sentence vectorization model, model training does not need to be carried out on different fields, the complexity of model training in the sentence vectorization processing process is effectively reduced, and the practicability and the efficiency of text vectorization processing are improved.

EXAMPLE five

Fig. 9 is a schematic structural diagram of a text vectorization processing apparatus according to an embodiment of the present invention, which can be used to execute the method steps shown in fig. 8, and includes:

a word vector obtaining module 910, configured to perform word vector conversion on a sentence to be processed to obtain a word vector matrix corresponding to the sentence;

a sentence vector obtaining module 920, configured to process the word vector matrix through a model algorithm in the attention model to obtain a sentence vector corresponding to the sentence;

Further, as shown in fig. 10, the text vectorization processing apparatus shown in fig. 9 may further include:

the model algorithm building module 101 is used for judging whether the attention model contains a domain context vector corresponding to a domain to which a sentence to be processed belongs;

and if not, selecting a global context vector to form the model algorithm.

The text vectorization processing device provided by the invention can realize vectorization processing of sentences in different fields through the same sentence vectorization model by introducing the attention mechanism of the field context and the global context of different industries, and does not need to respectively carry out model training aiming at different fields, thereby effectively reducing the complexity of model training in the sentence vectorization processing process and simultaneously improving the practicability and efficiency of text vectorization processing.

EXAMPLE six

The foregoing embodiment describes an overall architecture of a text vectorization processing apparatus, where functions of the apparatus can be implemented by an electronic device, as shown in fig. 11, which is a schematic structural diagram of the electronic device according to an embodiment of the present invention, and specifically includes: a memory 111 and a processor 112.

The memory 111 stores programs.

In addition to the above-described programs, the memory 111 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 111 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 112, coupled to the memory 111, for executing the program in the memory 111, the program executing the operation steps of the text vectorization processing method described in the foregoing embodiment.

Further, the processor 112 may also include various modules described in the foregoing embodiments to perform operations of the text vectorization processing, and the memory 111 may be used, for example, to store data required for the modules to perform the operations and/or output data.

Further, as shown in fig. 11, the electronic device may further include: communication components 113, power components 114, audio components 115, display 116, and other components. Only some of the components are schematically shown in fig. 11, and it is not meant that the electronic device includes only the components shown in fig. 11.

The communication component 113 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 113 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 113 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

A power supply component 114 that provides power to the various components of the electronic device. The power components 114 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for an electronic device.

Audio component 115 is configured to output and/or input audio signals. For example, audio component 115 may include a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 111 or transmitted via the communication component 113. In some embodiments, audio component 115 also includes a speaker for outputting audio signals.

The display 116 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A text similarity recognition method comprises the following steps:

obtaining sentence vectors corresponding to sentences in a given sentence set;

2. The method of claim 1, wherein the generating semantic signatures for respective sentences based on positive and negative features of each dimension value in the sentence feature vector and constructing an inverted index of semantic signatures into respective sentences comprises:

and segmenting the semantic signatures according to the segmentation standard of the sentence characteristic vector, and constructing an inverted index from each semantic signature segment to a corresponding sentence.

3. The method of claim 2, wherein the determining similar sentences corresponding to each sentence in the set of sentences according to the inverted index of each semantic signature comprises:

according to the given semantic signature fragment and the sentence set indexed by the semantic signature fragment, aiming at each sentence in the sentence set, selecting a corresponding similar sentence from the rest sentences in the sentence set.

4. The method of claim 3, wherein for each sentence in the set of sentences, selecting a corresponding similar sentence from the remaining sentences in the set of sentences comprises:

and aiming at each sentence in the sentence set, calculating the cosine value of the sentence vector corresponding to the sentence and each residual sentence, and selecting corresponding similar sentences from the residual sentences according to the cosine value.

5. The method of claim 1, wherein the obtaining a sentence vector corresponding to each sentence in the given set of sentences comprises:

processing each sentence in the given sentence set through LSTM to obtain a word vector matrix corresponding to the sentence;

6. The method of claim 5, wherein forming the model algorithm comprises:

judging whether the attention model contains a domain context vector corresponding to the domain to which the current sentence to be processed belongs;

if yes, selecting a corresponding field context vector to form the model algorithm;

and if not, selecting the global context vector to form the model algorithm.

7. The method of claim 1, wherein the feature encoder is a hash encoder.

8. A text similarity recognition apparatus comprising:

9. A text vectorization processing method, comprising:

10. The method of claim 9, wherein forming the model algorithm comprises:

judging whether the attention model contains a domain context vector corresponding to the domain to which the sentence to be processed belongs;

and if not, selecting the global context vector to form the model algorithm.

11. A text vectorization processing apparatus comprising:

12. An electronic device, comprising:

a memory for storing a program;

a processor coupled to the memory for executing the program, the program when running executing the text similarity recognition method of any one of claims 1-7.

13. An electronic device, comprising:

a memory for storing a program;

a processor coupled to the memory for executing the program, the program when running executing the text vectorization processing method of any of claims 9-10.