CN112749539A

CN112749539A - Text matching method and device, computer readable storage medium and computer equipment

Info

Publication number: CN112749539A
Application number: CN202010067253.6A
Authority: CN
Inventors: 梁涛; 李振阳; 张晗; 李超; 马连洋; 衡阵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2021-05-04
Anticipated expiration: 2040-01-20
Also published as: CN112749539B

Abstract

The application relates to a text matching method, a text matching device, a computer readable storage medium and a computer device, wherein the method comprises the following steps: acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text; respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; acquiring a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; extracting text matching features in the bidirectional associated information coding vector matrix, and generating a text matching degree identifier according to the text matching features; the text matching degree identification is used for marking the matching degree between the first text and the second text. By adopting the method, the accuracy of text matching can be effectively improved.

Description

Text matching method and device, computer readable storage medium and computer equipment

Technical Field

The present application relates to the field of computer information processing technologies, and in particular, to a text matching method, an apparatus, a computer-readable storage medium, and a computer device.

Background

With the rapid development of information technology, the application of information processing technology has been deeply applied to the aspects of people's life, for example, text matching technology has been widely applied to media content recommendation scenes, that is, a text association relationship is established through matching text information, so that corresponding associated content can be provided for a user through a pre-established text association relationship in an actual media content recommendation scene.

However, most of the existing text matching technologies stay in mining word-level information association, and only by mining similarity association of short texts, the inherent association information behind the similarity relationship is not considered, which undoubtedly results in low accuracy of subsequent text matching operation.

Therefore, the text matching method in the prior art has the problem of low text matching accuracy.

Disclosure of Invention

Therefore, it is necessary to provide a text matching method, a text matching device, a computer-readable storage medium, and a computer device, for solving the technical problem in the prior art that the text matching method has low text matching accuracy.

In one aspect, an embodiment of the present invention provides a text matching method, including: acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text; respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; acquiring a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; extracting text matching features in the bidirectional associated information coding vector matrix, and generating text matching degree identification according to the text matching features; the text matching degree identification is used for marking the matching degree between the first text and the second text.

In another aspect, an embodiment of the present invention provides a text matching apparatus, including: the word vector sequence acquisition module is used for acquiring a first word vector sequence of the first text and acquiring a second word vector sequence of the second text; the similarity matrix obtaining module is used for respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; the vector matrix construction module is used for acquiring a row vector sequence and a column vector sequence in the similarity matrix and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; the matching degree identification generation module is used for extracting text matching characteristics in the bidirectional associated information coding vector matrix and generating text matching degree identifications according to the text matching characteristics; the text matching degree identification is used for marking the matching degree between the first text and the second text.

In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of: acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text; respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; acquiring a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; extracting text matching features in the bidirectional associated information coding vector matrix, and generating text matching degree identification according to the text matching features; the text matching degree identification is used for marking the matching degree between the first text and the second text.

In another aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the following steps when executing the computer program: acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text; respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; acquiring a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; extracting text matching features in the bidirectional associated information coding vector matrix, and generating text matching degree identification according to the text matching features; the text matching degree identification is used for marking the matching degree between the first text and the second text.

According to the text matching method, the text matching device, the computer readable storage medium and the computer equipment, the server can calculate word vector similarity by obtaining the first word vector sequence of the first text and the second word vector sequence of the second text, so that a similarity matrix is obtained, and then a bidirectional associated information coding vector matrix is constructed by utilizing the row vector sequence and the column vector sequence in the similarity matrix and corresponding bidirectional associated information, so that text matching characteristics are extracted from the bidirectional associated information coding vector matrix, and text matching degree identification used for marking the matching degree between the first text and the second text is generated according to the text matching characteristics. By adopting the method, the similarity characteristic of the surface layer between the two texts is used, and the deep association information of the two texts is obtained by further mining the similarity characteristic, so that the accuracy of text matching is effectively improved.

Drawings

FIG. 1 is a diagram of an application environment of a text matching method in one embodiment;

FIG. 2 is a block diagram of a computer device in one embodiment;

FIG. 3 is a flow diagram that illustrates a method for text matching, according to one embodiment;

FIG. 4 is a flowchart illustrating a word vector sequence obtaining step according to an embodiment;

FIG. 5 is a flowchart illustrating a similarity matrix obtaining step according to an embodiment;

FIG. 6 is a flowchart illustrating a vector matrix obtaining step according to an embodiment;

FIG. 7 is a flowchart illustrating a bidirectional association information obtaining step in one embodiment;

FIG. 8 is a flowchart illustrating a vector matrix obtaining step in another embodiment;

FIG. 9 is a flowchart showing a text matching degree flag generating step in one embodiment;

FIG. 10 is a flowchart illustrating a text matching method in accordance with an exemplary embodiment;

fig. 11 is a block diagram showing a configuration of a text matching apparatus in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that the term "first \ second" referred to in the embodiments of the present invention is only used for distinguishing similar objects, and does not represent a specific ordering for the objects, and it should be understood that "first \ second" may exchange a specific order or sequence order if allowed. It should be understood that "first \ second" distinct objects may be interchanged under appropriate circumstances such that embodiments of the invention described herein may be practiced in sequences other than those illustrated or described herein.

FIG. 1 is a diagram of an application environment of the text matching method in one embodiment. Referring to fig. 1, the text matching method may be applied to a media content recommendation system. The media content recommendation system includes a terminal 110 and a server 120, which are connected via a network. Specifically, the server 120 may be implemented by an independent server or a server cluster composed of a plurality of servers, the terminal 110 may specifically be a desktop terminal or a mobile terminal, the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like, and the network includes but is not limited to: a wide area network, a metropolitan area network, or a local area network.

The media content recommendation system can be a media content similarity query recommendation tool based on mass data mining, and can help users to quickly screen out information in which the users are interested in the environment of information overload, so that personalized decision support and information service are provided for the users. Meanwhile, the media content recommendation system may refer to a system for performing media content recommendation for a user, for example: the article recommendation system refers to a system for article recommendation for users, and the article recommendation system can be implemented by an article reading platform such as an application program (for example, Tencent news). However, in practical applications, before implementing recommendation of media content by using a media content recommendation system, similarity mining needs to be performed on mass data, that is, an association relationship between two individual information is mined and established in advance, for example, similarity mining (also referred to as text matching) between text contents.

FIG. 2 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the server 120 in fig. 1. As shown in fig. 2, the computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store text analysis data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text matching method.

Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

As shown in FIG. 3, in one embodiment, a text matching method is provided. The embodiment is mainly illustrated by applying the method to the server 120 in fig. 1. Referring to fig. 3, the text matching method specifically includes the following steps:

s302, a first word vector sequence of the first text is obtained, and a second word vector sequence of the second text is obtained.

The first text and the second text can be text contents of similarity matching relation required to be mined currently, and can also be texts which are processed in advance in specifications such as long texts or short texts.

The first word vector sequence and the second word vector sequence may be word vector sequences generated by performing word segmentation and quantization on the first text and the second text, for example, the first text is subjected to word segmentation to obtain at least two text word segments, and the at least two text word vectors, that is, the first word vector sequence, may be obtained by performing vector conversion on the at least two text analysis vectors.

Specifically, before performing text matching, the server 120 may first receive a first text and a second text which are determined to be sent by the terminal 110, and then perform word segmentation and vectorization on the first text and the second text, convert the first text and the second text into a first word vector sequence corresponding to the first text and a second word vector sequence corresponding to the second text, and complete subsequent vector matching operations based on the first word vector sequence and the second word vector sequence.

S304, respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix.

The word vector similarity may refer to a similarity proportion degree of word vector features such as word sense, word property, word frequency and the like between the first word vector sequence and the second word vector sequence, and a value range of the similarity may be represented as a numerical range such as 0-1, 0-10 and the like, or may be represented as a percentage range such as 0-100%.

Specifically, since the matching calculation is performed on the first text and the second text, the word vectors in the first word vector sequence and the second word vector sequence need to be calculated one by one, that is, after the server 120 obtains the first text and the second text, and performs word segmentation and quantization on the first text and the second text to obtain more than one first word vector and second word vector, the cosine similarity between the first word vector and the second word vector is calculated one by one to obtain a plurality of sets of similarity results to construct and form a similarity matrix.

S306, acquiring a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence.

The row vector sequence may be a row vector sequence formed by dividing the similarity matrix in the row direction and including a plurality of vectors.

The column vector sequence may be a column vector sequence that is formed by dividing the similarity matrix in the column direction and includes a plurality of vectors.

The bidirectional associated information may be information generated by performing bidirectional associated information mining on a row vector sequence and a column vector sequence, and the bidirectional associated information may refer to two directions from a first text to a second text and from the second text to the first text for associated query.

Specifically, after obtaining the similarity matrix, the server 120 may first determine a row direction and a column direction of the similarity matrix, perform unidirectional segmentation of the row and the column to obtain a row vector sequence and a column vector sequence, and further input the row vector sequence and the column vector sequence together to a bidirectional associated information mining network, such as a Bi-directional Long and Short Term Memory network (Bi-directional LSTM) for capturing bidirectional semantic dependencies between two texts, and construct a bidirectional associated information coding vector matrix using the bidirectional associated information obtained by mining, so as to facilitate subsequent matrix feature extraction and matching.

S308, extracting text matching features in the bidirectional associated information coding vector matrix, and generating a text matching degree identifier according to the text matching features; the text matching degree identification is used for marking the matching degree between the first text and the second text.

The text matching feature may be a text feature to be extracted which is set in advance by learning.

The text matching degree identifier may be an identifier that records a matching relationship between the first text and the second text, for example, a text matching identifier and a text non-matching identifier.

Specifically, after the server 120 constructs a bidirectional associated information coding vector matrix based on bidirectional associated information of a row vector sequence and a column vector sequence, text matching features in the bidirectional associated information coding vector matrix may be further extracted, and for the extraction manner of the text matching features, feature extraction may be performed through a machine learning model or a deep learning model, for example, the bidirectional associated information coding vector matrix is input to a convolutional neural network model to perform text matching feature extraction, and then a text matching degree identifier is generated by using the text matching features, meanwhile, the generation manner of the text matching degree identifier may be to calculate similarity of the text matching features, and a calculation result is used as a basis for generating the text matching degree identifier.

In this embodiment, the server may calculate word vector similarity by obtaining a first word vector sequence of the first text and a second word vector sequence of the second text, so as to obtain a similarity matrix, and further construct a bidirectional associated information encoding vector matrix by using a row vector sequence and a column vector sequence in the similarity matrix and bidirectional associated information corresponding to the two vector sequences, so as to extract text matching features from the bidirectional associated information encoding vector matrix, and generate a text matching degree identifier for marking a matching degree between the first text and the second text according to the text matching features. By adopting the method, the similarity characteristic of the surface layer between the two texts is used, and the deep association information of the two texts is obtained by further mining the similarity characteristic, so that the accuracy of text matching is effectively improved.

As shown in fig. 4, in an embodiment, the obtaining a first word vector sequence of the first text and obtaining a second word vector sequence of the second text in step S302 specifically includes the following steps:

s3022, acquiring the first text and the second text.

Specifically, before the server 120 performs the text matching task, a first text and a second text are first acquired, and the first text and the second text may be determined to be sent by the user through the terminal 110, that is, the first text and the second text submitted by the user are sent to the server 120 by using the network connection between the terminal 110 and the server 120.

And S3024, performing word segmentation on the first text and the second text respectively to obtain a first word sequence of the first text and a second word sequence of the second text.

The first word sequence and the second word sequence can be word sequences obtained by performing word segmentation processing on the first text and the second text respectively.

Specifically, after obtaining the first text and the second text, the server 120 may perform word segmentation processing on the first text and the second text through a preset word segmentation algorithm, so as to obtain a first word sequence of the first text after word segmentation and a second word sequence of the second text after word segmentation.

More specifically, the word segmentation algorithm based on character string matching can be a mechanical word segmentation algorithm, the character string to be segmented can be matched with elements in a dictionary through a pre-established dictionary, if the matching is successful, the word is segmented, meanwhile, according to the difference of scanning directions, the character string matching word segmentation algorithm can be divided into positive matching and reverse matching, and according to the matching priorities of different lengths, the character string matching word segmentation algorithm can be divided into maximum matching and minimum matching; the comprehension-based word segmentation algorithm can be an algorithm which performs syntactic and semantic analysis simultaneously during word segmentation and processes word segmentation by utilizing syntactic information and semantic information; the word segmentation algorithm based on statistics may measure the probability of word combination by using the occurrence frequency of adjacent words, and when the frequency is higher than a preset threshold, it may be determined that the adjacent words constitute a word.

S3026, according to the pre-stored data vector mapping relationship, determining a mapping vector of the first word sequence as a first word vector sequence, and determining a mapping vector of the second word sequence as a second word vector sequence.

The pre-stored data vector mapping relationship may be a data mapping mechanism, that is, characters are mapped into real vectors corresponding to the characters.

Specifically, the server 120 may determine, through a pre-stored mapping relationship of data vectors, a real vector having a mapping relationship with a first word sequence as the first word vector sequence, and at the same time, determine a real vector having a mapping relationship with a second word sequence as the second word vector sequence, where the mapping relationship may be a one-to-one mapping relationship, so that the determined first word vector sequence and the determined second word vector sequence may be uniquely determined word vector sequences.

In the embodiment, the first text and the second text are subjected to word segmentation quantization, and the similarity is calculated by using the quantized word vector sequence, so that the accuracy of text matching can be effectively improved.

As shown in fig. 5, in an embodiment, the step S304 of respectively calculating the word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix includes the following steps:

s3042, determining first word vectors of at least two of the first word vector sequences, and determining second word vectors of at least two of the second word vector sequences.

The first word vector and the second word vector can be unit individual word vectors correspondingly existing in the first word vector sequence and the second word vector sequence.

Specifically, before calculating the word vector similarity, the server 120 first needs to determine the word vectors to be matched and calculated currently in the first word vector sequence and the second word vector sequence, and then calculates the similarity of the first word vector and the second word vector one by one, and the obtained similarity result may construct a similarity matrix.

S3044, multiplying the at least two first word vectors by the at least two second word vectors to obtain at least two word vector similarities.

Specifically, the server 120 may calculate the similarity between the first word vector and the second word vector by multiplying the first word vector and the second word vector one by one.

S3046, constructing a matrix according to the similarity of at least two word vectors to obtain a similarity matrix.

Specifically, the server 120 may use word vector similarity obtained by multiplying the word vectors to construct a similarity matrix, and the ordering manner of the matrix corresponds to the multiplication calculation order of the first word vector and the second word vector.

In the embodiment, the similarity matrix is constructed by respectively calculating the similarity between the vectors, so that the accuracy of text matching can be effectively improved.

As shown in fig. 6, in an embodiment, the step S306 acquires a row vector sequence and a column vector sequence in the similarity matrix, and constructs a bidirectional association information encoding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence, which specifically includes the following steps:

s3062, performing vector row-column division on the similarity matrix to obtain a row vector sequence and a column vector sequence.

Specifically, the method for the server 120 to perform vector row-column division on the similarity matrix may be a method of performing vector extraction in row and column directions on the similarity matrix, and further obtain a row vector sequence for performing row-direction extraction on the similarity matrix and a column vector sequence for performing column-direction extraction on the similarity matrix.

S3064, acquiring the bidirectional association information of the row vector sequence and the column vector sequence through the bidirectional long-term and short-term memory network.

The bidirectional long and short term memory Network is one of RNN (Current Neural Network), is suitable for modeling time series data, such as text data, and can be used for modeling context information in a natural language processing task.

Specifically, the server 120 may obtain the bidirectional association information of the row vector sequence and the column vector sequence through the bidirectional long-short term memory network, that is, the row vector sequence and the column vector sequence are jointly input to the bidirectional long-short term memory network, and the bidirectional association information is mined, so as to obtain the bidirectional association information corresponding to the row vector sequence and the column vector sequence, respectively.

S3066, coding the bidirectional associated information, and acquiring a bidirectional associated information coding vector matrix according to the coded bidirectional associated information.

Specifically, the encoding of the bidirectional associated information may be that the server 120 controls an encoding operation performed by a bidirectional long-term and short-term memory network to obtain a bidirectional associated information code, and the bidirectional associated information code may be used to construct an associated information code vector matrix for text matching feature extraction.

In the embodiment, the similarity characteristics of the surface layers between the two texts are used, and the similarity characteristics are further deeply mined through the bidirectional long-term and short-term memory network, so that the deep association information of the two texts is obtained, and the accuracy of text matching is finally improved.

As shown in fig. 7, in an embodiment, the step S3064 of obtaining the bidirectional association information of the row vector sequence and the column vector sequence through the bidirectional long-short term memory network specifically includes the following steps:

s30642, inputting the row vector sequence and the column vector sequence into the bidirectional long-short term memory network respectively;

s30644, acquiring first bidirectional associated information and second bidirectional associated information output by the bidirectional long-short term memory network as bidirectional associated information; the first bidirectional associated information and the second bidirectional associated information are information obtained by the bidirectional long-short term memory network respectively mining the bidirectional associated information according to the row vector sequence and the column vector sequence.

Specifically, the server 120 may obtain the first bidirectional association information and the second bidirectional association information output by the bidirectional long-short term memory network by inputting the row vector sequence and the column vector sequence into the bidirectional long-short term memory network, respectively, so as to obtain the bidirectional association information of the row vector sequence and the column vector sequence.

In the embodiment, the bidirectional associated information of the row vector sequence and the column vector sequence is acquired through the bidirectional long-short term memory network, so that the accuracy of text matching can be effectively improved.

As shown in fig. 8, in an embodiment, the bidirectional associated information includes first bidirectional associated information and second bidirectional associated information, the step S3066 encodes the bidirectional associated information, and obtains a bidirectional associated information encoding vector matrix according to the encoded bidirectional associated information, which specifically includes the following steps:

s30662, encoding the first bidirectional associated information and the second bidirectional associated information respectively to obtain a first bidirectional associated information code and a second bidirectional associated information code;

s30664, combining the first bidirectional associated information code with the row vector sequence to obtain a first information code vector matrix, and combining the second bidirectional associated information code with the column vector sequence to obtain a second information code vector matrix;

s30666, the first information encoding vector matrix and the second information encoding vector matrix are determined as the bidirectional associated information encoding vector matrix.

Specifically, the server 120 may control the bidirectional long-short term memory network to encode the first bidirectional correlation information and the second bidirectional correlation information, so as to obtain a first bidirectional correlation information code and a second bidirectional correlation information code, further combine the first bidirectional correlation information code with the row vector sequence to obtain a first information code vector matrix, combine the second bidirectional correlation information code with the column vector sequence to obtain a second information code vector matrix, and finally obtain a bidirectional correlation information code vector matrix corresponding to the row vector sequence or the column vector sequence.

In the embodiment, the text matching characteristics for calculating the text similarity are acquired by constructing the bidirectional associated information coding vector matrix, so that the accuracy of text matching can be effectively improved.

As shown in fig. 9, in an embodiment, the extracting text matching features in the bidirectional associated information coding vector matrix in step S308, and generating a text matching degree identifier according to the text matching features specifically includes the following steps:

s3082, inputting the bidirectional associated information coding vector matrix into a convolutional neural network model;

s3084, acquiring feature information output by the convolutional neural network model as a text matching feature;

s3086, inputting the text matching features into a full connection layer of the convolutional neural network model;

s3088, acquiring an output result of the full connection layer to obtain a text matching degree identifier; the output result is the result of matching the full connection layer according to the text matching characteristics; the text matching degree identification comprises a text matching identification and a text non-matching identification.

Among them, Convolutional Neural Networks (CNN) is a kind of feed-forward Neural network containing convolution calculation and having a deep structure, and is one of the representative algorithms of deep learning.

Specifically, the server 120 may first input the bidirectional associated information coding vector matrix to the convolutional neural network model, further obtain a text matching feature output by the convolutional neural network model, input the text matching feature to the full connection layer of the convolutional neural network model, output a feature matching result by the full connection layer, and further generate a text matching degree identifier by the server 120 according to the result, so as to represent whether the first text and the second text are matched, it should be understood that the text matching degree identifier referred to in this embodiment may be a text matching identifier represented by a value 1, and a text mismatching identifier represented by a value 0.

In the embodiment, the text matching characteristics are obtained through the convolutional neural network model, and the output result of the text matching is obtained through the full connection layer, so that the text matching degree identification is obtained, and the accuracy of the text matching can be effectively improved.

To facilitate a thorough understanding of the embodiments of the present application by those skilled in the art, a specific example will be described below with reference to fig. 10. Fig. 10 is a schematic diagram of a complete flow of the text matching method in the present embodiment, and it can be seen from fig. 10 that: the server 120 firstly obtains a first text (short text 1) and a second text (short text 2), further performs word segmentation on the first text (short text 1) and the second text (short text 2) to obtain a first word sequence (word 1-N) and a second word sequence (word 1-k), then obtains a first word vector (word vector 1-N) and a second word vector (word vector 1-k) respectively corresponding to the first word sequence (word 1-N) and the second word sequence (word 1-k) by adopting a data mapping mechanism, the first word vector (word vector 1-N) and the second word vector (word vector 1-k) can perform information interaction to obtain a similarity matrix N, and a row vector sequence and a column vector sequence obtained after the similarity matrix N is subjected to row and column segmentation can be further input to a two-way long-memory network, the information coding is used for acquiring the information coding output by the media content recommendation system, the information coding can construct a bidirectional associated information coding vector matrix and is used for inputting the bidirectional associated information coding vector matrix to the convolutional neural network, so that the matching feature information output by the convolutional neural network is acquired, the feature matching result of the two texts is acquired through a full connection layer, and finally the server 120 generates a text matching degree identifier by using the feature matching result, and the text matching degree identifier is used for marking the matching degree between the first text and the second text and then is applied to the media content recommendation system.

In the embodiment, the similarity characteristic of the surface layer between the two texts is used, and the deep association information of the two texts is obtained by deep mining by using the similarity characteristic, so that the accuracy of text matching is effectively improved.

It should be understood that although the various steps in the flow charts of fig. 3-9 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to be performed in the exact order described, unless explicitly stated in the text, and the steps may be performed in other orders. Moreover, at least some of the steps in fig. 3-9 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

As shown in fig. 11, in an embodiment, a text matching apparatus 1100 is provided, where the apparatus 1100 may be disposed in a text matching system, and is configured to execute the text matching method, where the text matching apparatus 1100 specifically includes: a word vector sequence obtaining module 1102, a similarity matrix obtaining module 1104, a vector matrix constructing module 1106, and a matching degree identifier generating module 1108, where:

a word vector sequence obtaining module 1102, configured to obtain a first word vector sequence of the first text, and obtain a second word vector sequence of the second text;

a similarity matrix obtaining module 1104, configured to calculate word vector similarities between the first word vector sequence and the second word vector sequence, respectively, to obtain a similarity matrix;

a vector matrix constructing module 1106, configured to obtain a row vector sequence and a column vector sequence in the similarity matrix, and construct a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence;

the matching degree identifier generating module 1108 is configured to extract text matching features in the bidirectional associated information coding vector matrix, and generate a text matching degree identifier according to the text matching features; the text matching degree identification is used for marking the matching degree between the first text and the second text.

In one embodiment, the word vector sequence obtaining module 1102 is further configured to obtain a first text and a second text; respectively segmenting the first text and the second text to obtain a first word sequence of the first text and a second word sequence of the second text; and determining a mapping vector of the first word sequence as a first word vector sequence and determining a mapping vector of the second word sequence as a second word vector sequence according to a pre-stored data vector mapping relation.

In one embodiment, the similarity matrix obtaining module 1104 is further configured to determine first word vectors of at least two of the first word vector sequences, and determine second word vectors of at least two of the second word vector sequences; multiplying at least two first word vectors by at least two second word vectors respectively to obtain at least two word vector similarities; and constructing a matrix according to the similarity of at least two word vectors to obtain a similarity matrix.

In one embodiment, the vector matrix constructing module 1106 is further configured to perform vector row-column division on the similarity matrix to obtain a row vector sequence and a column vector sequence; acquiring bidirectional association information of a row vector sequence and a column vector sequence through a bidirectional long-short term memory network; and coding the bidirectional associated information, and acquiring a bidirectional associated information coding vector matrix according to the coded bidirectional associated information.

In one embodiment, the vector matrix construction module 1106 is further configured to input the row vector sequence and the column vector sequence into the bidirectional long-short term memory network, respectively; acquiring first bidirectional associated information and second bidirectional associated information output by a bidirectional long-short term memory network as bidirectional associated information; the first bidirectional associated information and the second bidirectional associated information are information obtained by the bidirectional long-short term memory network respectively mining the bidirectional associated information according to the row vector sequence and the column vector sequence.

In one embodiment, the vector matrix constructing module 1106 is further configured to encode the first bidirectional association information and the second bidirectional association information respectively to obtain a first bidirectional association information code and a second bidirectional association information code; combining the first bidirectional associated information code with a row vector sequence to obtain a first information code vector matrix, and combining the second bidirectional associated information code with a column vector sequence to obtain a second information code vector matrix; and determining the first information coding vector matrix and the second information coding vector matrix as the bidirectional associated information coding vector matrix.

In one embodiment, the matching degree identifier generating module 1108 is further configured to input the bidirectional correlation information coding vector matrix to a convolutional neural network model; acquiring feature information output by the convolutional neural network model as text matching features; inputting the text matching features into a full connection layer of the convolutional neural network model; acquiring an output result of the full connection layer to obtain a text matching degree identifier; the output result is the result of matching the full connection layer according to the text matching characteristics; the text matching degree identification comprises a text matching identification and a text non-matching identification.

In this embodiment, the server may calculate word vector similarity by obtaining a first word vector sequence of the first text and a second word vector sequence of the second text, so as to obtain a similarity matrix, and further construct a bidirectional associated information encoding vector matrix by using a row vector sequence and a column vector sequence in the similarity matrix and bidirectional associated information corresponding to the two vector sequences, so as to extract text matching features from the bidirectional associated information encoding vector matrix, and generate a text matching degree identifier for marking a matching degree between the first text and the second text according to the text matching features. By adopting the scheme, the similarity characteristic of the surface layer between the two texts is used, and the deep association information of the two texts is obtained by further mining the similarity characteristic, so that the accuracy of text matching is effectively improved.

In one embodiment, the text matching apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 2. The memory of the computer device may store various program modules constituting the text matching apparatus, such as a word vector sequence obtaining module 1102, a similarity matrix obtaining module 1104, a vector matrix constructing module 1106, and a matching degree identifier generating module 1108 shown in fig. 11. The computer program constituted by the respective program modules causes the processor to execute the steps in the text matching method of the respective embodiments of the present application described in the present specification.

For example, the computer apparatus shown in fig. 2 may perform step S302 by the word vector sequence acquisition module 1102 in the text matching apparatus shown in fig. 11. The computer device may perform step S304 through the similarity matrix acquisition module 1104. The computer device may perform step S306 by the vector matrix construction module 1106. The computer device may execute step S308 by the matching degree identification generation module 1108.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the text matching method described above. Here, the steps of the text matching method may be steps in the text matching method of each of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored, which, when executed by a processor, causes the processor to perform the steps of the above-described text matching method. Here, the steps of the text matching method may be steps in the text matching method of each of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A text matching method is characterized by comprising the following steps:

acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text;

respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix;

acquiring a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence;

extracting text matching features in the bidirectional associated information coding vector matrix, and generating a text matching degree identifier according to the text matching features; the text matching degree identification is used for marking the matching degree between the first text and the second text.

2. The method of claim 1, wherein obtaining a first word vector sequence for a first text and obtaining a second word vector sequence for a second text comprises:

acquiring a first text and a second text;

performing word segmentation on the first text and the second text respectively to obtain a first word sequence of the first text and a second word sequence of the second text;

and determining a mapping vector of the first word sequence as the first word vector sequence and determining a mapping vector of the second word sequence as the second word vector sequence according to a pre-stored data vector mapping relation.

3. The method of claim 1, wherein the separately calculating word vector similarities between the first word vector sequence and the second word vector sequence to obtain a similarity matrix comprises:

determining a first word vector of at least two of the first sequence of word vectors and determining a second word vector of at least two of the second sequence of word vectors;

multiplying the at least two first word vectors by the at least two second word vectors respectively to obtain at least two word vector similarities;

and constructing a matrix according to the similarity of the at least two word vectors to obtain the similarity matrix.

4. The method according to claim 1, wherein the obtaining a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence comprises:

performing vector row-column division on the similarity matrix to obtain a row vector sequence and a column vector sequence;

acquiring bidirectional association information of the row vector sequence and the column vector sequence through a bidirectional long-short term memory network;

and coding the bidirectional associated information, and acquiring a bidirectional associated information coding vector matrix according to the coded bidirectional associated information.

5. The method of claim 4, wherein the obtaining the bidirectional association information of the row vector sequence and the column vector sequence through a bidirectional long-short term memory network comprises:

inputting the row vector sequence and the column vector sequence into the bidirectional long-short term memory network respectively;

acquiring first bidirectional associated information and second bidirectional associated information output by the bidirectional long-short term memory network as the bidirectional associated information; the first bidirectional associated information and the second bidirectional associated information are information obtained by the bidirectional long-short term memory network respectively mining bidirectional associated information according to the row vector sequence and the column vector sequence.

6. The method according to claim 4, wherein the bidirectional association information includes first bidirectional association information and second bidirectional association information, and the encoding the bidirectional association information and obtaining the bidirectional association information encoding vector matrix according to the encoded bidirectional association information includes:

respectively coding the first bidirectional associated information and the second bidirectional associated information to obtain a first bidirectional associated information code and a second bidirectional associated information code;

combining the first bidirectional associated information code with the row vector sequence to obtain a first information code vector matrix, and combining the second bidirectional associated information code with the column vector sequence to obtain a second information code vector matrix;

and determining the first information coding vector matrix and the second information coding vector matrix as the bidirectional associated information coding vector matrix.

7. The method according to claim 1, wherein the extracting text matching features in the bidirectional associated information encoding vector matrix and generating text matching degree identification according to the text matching features comprises:

inputting the bidirectional correlation information coding vector matrix into a convolutional neural network model;

acquiring feature information output by the convolutional neural network model as the text matching feature;

inputting the text matching features to a fully connected layer of the convolutional neural network model;

acquiring an output result of the full connection layer to obtain the text matching degree identification; the output result is the result of matching the full connection layer according to the text matching characteristics; the text matching degree identification comprises a text matching identification and a text mismatching identification.

8. A text matching apparatus, characterized in that the apparatus comprises:

the word vector sequence acquisition module is used for acquiring a first word vector sequence of the first text and acquiring a second word vector sequence of the second text;

a similarity matrix obtaining module, configured to calculate word vector similarities between the first word vector sequence and the second word vector sequence, respectively, to obtain a similarity matrix;

the vector matrix construction module is used for acquiring a row vector sequence and a column vector sequence in the similarity matrix and constructing a bidirectional associated information coding vector matrix based on bidirectional associated information of the row vector sequence and the column vector sequence;

the matching degree identification generation module is used for extracting text matching characteristics in the bidirectional associated information coding vector matrix and generating text matching degree identifications according to the text matching characteristics; the text matching degree identification is used for marking the matching degree between the first text and the second text.

9. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.

10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.