CN112749539B - Text matching method, text matching device, computer readable storage medium and computer equipment - Google Patents

Text matching method, text matching device, computer readable storage medium and computer equipment Download PDF

Info

Publication number
CN112749539B
CN112749539B CN202010067253.6A CN202010067253A CN112749539B CN 112749539 B CN112749539 B CN 112749539B CN 202010067253 A CN202010067253 A CN 202010067253A CN 112749539 B CN112749539 B CN 112749539B
Authority
CN
China
Prior art keywords
text
association information
word
vector
vector sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010067253.6A
Other languages
Chinese (zh)
Other versions
CN112749539A (en
Inventor
梁涛
李振阳
张晗
李超
马连洋
衡阵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010067253.6A priority Critical patent/CN112749539B/en
Publication of CN112749539A publication Critical patent/CN112749539A/en
Application granted granted Critical
Publication of CN112749539B publication Critical patent/CN112749539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a text matching method, a text matching device, a computer readable storage medium and computer equipment, wherein the method comprises the following steps: acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text; respectively calculating the word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; acquiring a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; extracting text matching characteristics in the bidirectional association information coding vector matrix, and generating a text matching degree identifier according to the text matching characteristics; the text matching degree identifier is used for marking the matching degree between the first text and the second text. By adopting the method, the accuracy of text matching can be effectively improved.

Description

Text matching method, text matching device, computer readable storage medium and computer equipment
Technical Field
The present application relates to the field of computer information processing technology, and in particular, to a text matching method, a text matching device, a computer readable storage medium, and a computer device.
Background
With the rapid development of information technology, the application of the information processing technology has been advanced to the aspects of people's life, for example, text matching technology has been widely applied to media content recommendation scenes, that is, text association is established by matching text information, so that corresponding association content can be provided for users through pre-established text association in actual media content recommendation scenes.
However, the existing text matching technology is mostly in the mining of word-level information association, and only by mining the similarity association of short text, the intrinsic association information behind the similarity relationship is not considered, which can certainly cause lower accuracy of the subsequent text matching operation.
Therefore, the text matching method in the prior art has the problem of low text matching accuracy.
Disclosure of Invention
Based on the foregoing, it is necessary to provide a text matching method, a device, a computer readable storage medium and a computer apparatus, aiming at the technical problem of low text matching accuracy in the text matching method in the prior art.
In one aspect, an embodiment of the present invention provides a text matching method, including: acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text; respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; acquiring a row vector sequence and a column vector sequence in a similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; extracting text matching features in the two-way association information coding vector matrix, and generating text matching degree identifiers according to the text matching features; the text match identification is used to mark the match between the first text and the second text.
In another aspect, an embodiment of the present invention provides a text matching apparatus, including: the word vector sequence acquisition module is used for acquiring a first word vector sequence of the first text and acquiring a second word vector sequence of the second text; the similarity matrix acquisition module is used for respectively calculating the word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; the vector matrix construction module is used for acquiring a row vector sequence and a column vector sequence in the similarity matrix and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; the matching degree identification generation module is used for extracting text matching characteristics in the bidirectional association information coding vector matrix and generating text matching degree identification according to the text matching characteristics; the text match identification is used to mark the match between the first text and the second text.
In yet another aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of: acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text; respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; acquiring a row vector sequence and a column vector sequence in a similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; extracting text matching features in the two-way association information coding vector matrix, and generating text matching degree identifiers according to the text matching features; the text match identification is used to mark the match between the first text and the second text.
In yet another aspect, an embodiment of the present invention provides a computer device including a memory storing a computer program and a processor implementing the following steps when executing the computer program: acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text; respectively calculating word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix; acquiring a row vector sequence and a column vector sequence in a similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence; extracting text matching features in the two-way association information coding vector matrix, and generating text matching degree identifiers according to the text matching features; the text match identification is used to mark the match between the first text and the second text.
According to the text matching method, the device, the computer readable storage medium and the computer equipment, the server can calculate the word vector similarity by acquiring the first word vector sequence of the first text and the second word vector sequence of the second text, so that a similarity matrix is obtained, and a bidirectional association information coding vector matrix is constructed by utilizing the row vector sequence, the column vector sequence and the corresponding bidirectional association information in the similarity matrix so as to extract text matching characteristics from the bidirectional association information coding vector matrix, and a text matching degree identifier for marking the matching degree between the first text and the second text is generated according to the text matching characteristics. By adopting the method, the similarity characteristics of the surface layers between the two texts are used, and the deep association information of the two texts is obtained by deep mining by utilizing the similarity characteristics, so that the accuracy of text matching is effectively improved.
Drawings
FIG. 1 is an application environment diagram of a text matching method in one embodiment;
FIG. 2 is a block diagram of a computer device in one embodiment;
FIG. 3 is a flow diagram of a text matching method in one embodiment;
FIG. 4 is a flowchart illustrating a word vector sequence acquisition step in one embodiment;
FIG. 5 is a flowchart illustrating a similarity matrix obtaining step according to an embodiment;
FIG. 6 is a flowchart illustrating a vector matrix acquisition step in one embodiment;
FIG. 7 is a flow chart of a two-way association information acquisition step in one embodiment;
FIG. 8 is a flowchart illustrating a vector matrix obtaining step according to another embodiment;
FIG. 9 is a flow diagram of a text match identification generation step in one embodiment;
FIG. 10 is a flow chart of a text matching method in one embodiment;
fig. 11 is a block diagram of a text matching device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
It should be noted that, the term "first\second" related to the embodiment of the present invention is merely to distinguish similar objects, and does not represent a specific order for the objects, it is understood that "first\second" may interchange a specific order or precedence where allowed. It is to be understood that the "first\second" distinguishing aspects may be interchanged where appropriate to enable embodiments of the invention described herein to be implemented in sequences other than those illustrated or described.
FIG. 1 is a diagram of an application environment for a text matching method in one embodiment. Referring to fig. 1, the text matching method is applicable to a media content recommendation system. The media content recommendation system includes a terminal 110 and a server 120 connected through a network. Specifically, the server 120 may be implemented by a stand-alone server or a server cluster formed by a plurality of servers, the terminal 110 may be specifically a desktop terminal or a mobile terminal, and the mobile terminal may be specifically at least one of a mobile phone, a tablet computer, a notebook computer, etc., where the network includes but is not limited to: a wide area network, a metropolitan area network, or a local area network.
The media content recommendation system can be a media content similarity query recommendation tool based on mass data mining, can help users to quickly screen out information of interest in an information overload environment, and provides personalized decision support and information service for the users. Meanwhile, the media content recommendation system may refer to a system that makes media content recommendation for a user, for example: an article recommendation system refers to a system for recommending articles for users, and the article recommendation system can be realized by an article reading platform such as an application program (e.g. news). However, before the media content recommendation system is used to implement recommendation of media content in practical application, similarity mining needs to be performed on massive data, that is, the association relationship between two pieces of individual information is pre-mined, for example, similarity mining (may also be referred to as text matching) between text contents is established, so for convenience of description, the text matching method will be described below by taking the media content recommendation system as an example, and it should be understood that the embodiment of the invention is not limited to the media content recommendation system, and may also be applied to other systems, such as a video recommendation system, a hot spot recall system, and the like.
FIG. 2 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be specifically the server 120 of fig. 1. As shown in fig. 2, the computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing text analysis data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text matching method.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 2 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
As shown in fig. 3, in one embodiment, a text matching method is provided. The present embodiment is mainly exemplified by the application of the method to the server 120 in fig. 1. Referring to fig. 3, the text matching method specifically includes the steps of:
s302, a first word vector sequence of a first text is obtained, and a second word vector sequence of a second text is obtained.
The first text and the second text can be text contents of similarity matching relations to be mined currently required, and can also be text with specifications of long text or short text and the like which are processed in advance.
The first word vector sequence and the second word vector sequence may be word vector sequences generated by word segmentation and quantization of the first text and the second text, for example, the first text is segmented to obtain at least two text segmented words, and the at least two text analysis is converted to vector quantities to obtain at least two text word vectors, i.e., the first word vector sequence.
Specifically, before performing text matching, the server 120 may first receive the determined first text and the determined second text sent by the terminal 110, and then segment and vectorize the first text and the second text, and convert the first text and the second text into a first word vector sequence corresponding to the first text and a second word vector sequence corresponding to the second text, and complete subsequent vector matching operations according to the first word vector sequence and the second word vector sequence.
S304, word vector similarity between the first word vector sequence and the second word vector sequence is calculated respectively, and a similarity matrix is obtained.
The term vector similarity may refer to the degree of similarity proportion of term vector features, such as word sense, part of speech, word frequency, etc., between the first term vector sequence and the second term vector sequence, where the value range of the similarity may be represented by a numerical value range, such as 0-1, 0-10, etc., or may be represented by a percentage range, such as 0-100%.
Specifically, since the matching calculation is performed on the first text and the second text, the multiple word vectors in the first word vector sequence and the second word vector sequence need to be calculated one by one, that is, the server 120 obtains the first text and the second text, and performs word segmentation quantization on the first text and the second text to obtain more than one first word vector and more than one second word vector, and then, the cosine similarity between the first word vector and the second word vector is calculated one by one to obtain multiple groups of similarity results to construct a similarity matrix.
S306, acquiring a row vector sequence and a column vector sequence in the similarity matrix, and constructing a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence.
The row vector sequence may be a row vector sequence formed by dividing a similarity matrix in a row direction and including a plurality of vectors.
The column vector sequence may be a column vector sequence formed by dividing a similarity matrix in a column direction and including a plurality of vectors.
The bi-directional association information may be information generated by mining bi-directional association information of a row vector sequence and a column vector sequence, and bi-directional may refer to two directions of association query from a first text to a second text and from the second text to the first text.
Specifically, after obtaining the similarity matrix, the server 120 may first determine a row direction and a column direction of the similarity matrix, and perform unidirectional segmentation on the row direction and the column direction to obtain a row vector sequence and a column vector sequence, and then input the row vector sequence and the column vector sequence together into a Bi-directional association information mining network, such as Bi-LSTM (Bi-deractional Long Short-Term Memory) network, for capturing Bi-directional semantic dependency between two texts, and construct a Bi-directional association information encoding vector matrix by using the Bi-directional association information obtained by mining, so as to facilitate subsequent matrix feature extraction and matching.
S308, extracting text matching features in the two-way association information coding vector matrix, and generating a text matching degree identifier according to the text matching features; the text match identification is used to mark the match between the first text and the second text.
The text matching feature may be a text feature to be extracted that is set in advance in a learning manner.
The text matching degree identifier may be an identifier for recording a matching relationship between the first text and the second text, for example, a text matching identifier and a text non-matching identifier.
Specifically, after the server 120 constructs a bidirectional association information encoding vector matrix based on the bidirectional association information of the row vector sequence and the column vector sequence, text matching features in the bidirectional association information encoding vector matrix may be further extracted, and for the extraction mode of the text matching features, feature extraction may be performed through a machine learning model or a deep learning model, for example, the bidirectional association information encoding vector matrix is input into a convolutional neural network model to perform text matching feature extraction, and further text matching feature is used to generate text matching degree identifiers, and meanwhile, the generation mode of the text matching degree identifiers may be that similarity of the text matching features is calculated, and the calculation result is used as a basis for generating the text matching degree identifiers.
In this embodiment, the server may calculate the word vector similarity by acquiring a first word vector sequence of the first text and a second word vector sequence of the second text, thereby obtaining a similarity matrix, and further construct a bi-directional association information encoding vector matrix by using the row vector sequence and the column vector sequence in the similarity matrix and the corresponding bi-directional association information thereof, so as to extract text matching features from the bi-directional association information encoding vector matrix, and generate a text matching degree identifier for marking the matching degree between the first text and the second text according to the text matching features. By adopting the method, the similarity characteristics of the surface layers between the two texts are used, and the deep association information of the two texts is obtained by deep mining by utilizing the similarity characteristics, so that the accuracy of text matching is effectively improved.
As shown in fig. 4, in one embodiment, the step S302 of obtaining the first word vector sequence of the first text, and the step of obtaining the second word vector sequence of the second text specifically includes the following steps:
s3022, acquiring a first text and a second text.
Specifically, before the server 120 performs the text matching task, the first text and the second text need to be acquired first, and the first text and the second text may be determined to be sent by the user through the terminal 110, that is, the first text and the second text submitted by the user are sent to the server 120 by using the network connection before the terminal 110 and the server 120.
S3024, word segmentation is carried out on the first text and the second text respectively, and a first word sequence of the first text and a second word sequence of the second text are obtained.
The first word sequence and the second word sequence can be word sequences obtained by word segmentation of the first text and the second text respectively.
Specifically, after the server 120 obtains the first text and the second text, the first text and the second text may be subjected to word segmentation processing by a preset word segmentation algorithm, so as to obtain a first word sequence of the first text after word segmentation and a second word sequence of the second text after word segmentation, and it should be understood that the preset word segmentation algorithm may be a word segmentation algorithm based on character string matching, a word segmentation algorithm based on understanding, and a word segmentation algorithm based on statistics.
More specifically, the word segmentation algorithm based on character string matching can be a mechanical word segmentation algorithm, the character string to be segmented can be matched with elements in a dictionary through a pre-established dictionary, if the matching is successful, the word is segmented, meanwhile, the character string matching word segmentation algorithm can be divided into positive matching and reverse matching according to different scanning directions, and can be divided into maximum matching and minimum matching according to matching priorities of different lengths; the word segmentation algorithm based on understanding can be an algorithm for simultaneously carrying out syntactic and semantic analysis during word segmentation and processing word segmentation by utilizing syntactic information and semantic information; the statistical word segmentation algorithm may measure the likelihood of word segmentation by using the frequency of occurrence of adjacent words, and when the frequency is higher than a preset threshold, it may be determined that the adjacent words form a word.
S3026, according to the pre-stored data vector mapping relation, determining a mapping vector of the first word sequence as the first word vector sequence, and determining a mapping vector of the second word sequence as the second word vector sequence.
The pre-stored data vector mapping relationship may be a data mapping mechanism, that is, mapping characters into their corresponding real number vectors.
Specifically, the server 120 may determine, as the first word vector sequence, a real vector having a mapping relationship with the first word sequence through a pre-stored data vector mapping relationship, and determine, as the second word vector sequence, a real vector having a mapping relationship with the second word sequence, where the mapping relationship may be a one-to-one mapping relationship, so that the determined first word vector sequence and the second word vector sequence may be unique determined word vector sequences.
In this embodiment, the word segmentation quantization is performed on the first text and the second text, and the similarity is calculated by using the quantized word vector sequence, so that the accuracy of text matching can be effectively improved.
As shown in fig. 5, in one embodiment, in step S304, word vector similarity between the first word vector sequence and the second word vector sequence is calculated, so as to obtain a similarity matrix, which specifically includes the following steps:
S3042, determining first word vectors of at least two of the first word vector sequences, and determining second word vectors of at least two of the second word vector sequences.
The first word vector and the second word vector may be unit individual word vectors corresponding to the first word vector sequence and the second word vector sequence.
Specifically, before calculating the similarity of the word vectors, the server 120 first needs to determine the word vector currently being calculated to be matched in the first word vector sequence and the second word vector sequence, and then calculates the similarity of the first word vector and the second word vector one by one, so that the obtained similarity result can construct a similarity matrix.
S3044, multiplying the at least two first word vectors and the at least two second word vectors to obtain at least two word vector similarity.
Specifically, the server 120 may calculate the similarity between the first word vector and the second word vector by multiplying the first word vector and the second word vector one by one.
S3046, constructing a matrix according to the similarity of at least two word vectors, and obtaining a similarity matrix.
Specifically, the server 120 may construct a similarity matrix by using the word vector similarity obtained by multiplying the word vectors, and the ordering manner of the matrix corresponds to the multiplication calculation sequence of the first word vector and the second word vector.
In this embodiment, the similarity matrix is constructed by respectively calculating the similarity between the vectors, so that the accuracy of text matching can be effectively improved.
As shown in fig. 6, in one embodiment, step S306 obtains a row vector sequence and a column vector sequence in the similarity matrix, and constructs a bidirectional association information encoding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence, which specifically includes the following steps:
s3062, vector row-column segmentation is carried out on the similarity matrix, and a row vector sequence and a column vector sequence are obtained.
Specifically, the manner in which the server 120 performs the vector row-column division on the similarity matrix may be a manner in which the similarity matrix is subjected to the vector extraction in the row and column directions, so as to obtain a row vector sequence for performing the row-direction extraction on the similarity matrix, and a column vector sequence for performing the row-direction extraction on the similarity matrix.
S3064, obtaining the bidirectional association information of the row vector sequence and the column vector sequence through a bidirectional long-short-term memory network.
The two-way long and short term memory network is one of the RNN networks (Recurrent Neural Network) and is suitable for modeling time series data, such as text data, and can be used for modeling context information in natural language processing tasks.
Specifically, the server 120 may obtain, through the bidirectional long-short-term memory network, bidirectional association information of the row vector sequence and the column vector sequence, that is, input the row vector sequence and the column vector sequence together into the bidirectional long-term memory network, perform mining of the bidirectional association information, and further obtain bidirectional association information corresponding to the row vector sequence and the column vector sequence respectively.
S3066, coding the bidirectional association information, and obtaining a bidirectional association information coding vector matrix according to the coded bidirectional association information.
Specifically, the encoding of the bi-directional association information may be that the server 120 controls the encoding operation performed by the bi-directional long-short-term memory network to obtain a bi-directional association information encoding, and the bi-directional association information encoding may be used to construct an association information encoding vector matrix for extracting text matching features.
In this embodiment, not only the similarity feature of the surface layer between two texts is used, but also the similarity feature is deeply mined through the two-way long-short-term memory network, so that the deep association information of the two texts is obtained, and finally the accuracy of text matching is improved.
As shown in fig. 7, in one embodiment, in step S3064, bi-directional association information of a row vector sequence and a column vector sequence is obtained through a bi-directional long-short-term memory network, which specifically includes the following steps:
S30642, respectively inputting a row vector sequence and a column vector sequence into a two-way long-short-term memory network;
s30610, acquiring first bidirectional association information and second bidirectional association information output by a bidirectional long-short-term memory network as the bidirectional association information; the first bidirectional association information and the second bidirectional association information are information which is obtained by respectively mining the bidirectional association information according to a row vector sequence and a column vector sequence by the bidirectional long-short-term memory network.
Specifically, the server 120 may obtain the first bidirectional association information and the second bidirectional association information output by the bidirectional long-short-term memory network by inputting the row vector sequence and the column vector sequence to the bidirectional long-short-term memory network, respectively, so as to obtain the bidirectional association information of the row vector sequence and the column vector sequence.
In this embodiment, the bidirectional association information of the row vector sequence and the column vector sequence is obtained through the bidirectional long-short-term memory network, so that the accuracy of text matching can be effectively improved.
As shown in fig. 8, in one embodiment, the bi-directional association information includes a first bi-directional association information and a second bi-directional association information, and in step S3066, the bi-directional association information is encoded, and according to the encoded bi-directional association information, a bi-directional association information encoding vector matrix is obtained, which specifically includes the following steps:
S3062, respectively encoding the first bidirectional association information and the second bidirectional association information to obtain a first bidirectional association information code and a second bidirectional association information code;
s3064, combining the first bidirectional association information code with the row vector sequence to obtain a first information code vector matrix, and combining the second bidirectional association information code with the column vector sequence to obtain a second information code vector matrix;
s30666, determining a first information coding vector matrix and a second information coding vector matrix as a bi-directional association information coding vector matrix.
Specifically, the server 120 may control the bidirectional long-short-term memory network to encode the first bidirectional association information and the second bidirectional association information, thereby obtaining a first bidirectional association information code and a second bidirectional association information code, further combine the first bidirectional association information code with the row vector sequence to obtain a first information code vector matrix, combine the second bidirectional association information code with the column vector sequence to obtain a second information code vector matrix, and finally obtain a bidirectional association information code vector matrix corresponding to the row vector sequence or the column vector sequence.
In this embodiment, the text matching feature for calculating the text similarity is obtained by constructing the bidirectional association information encoding vector matrix, so that the accuracy of text matching can be effectively improved.
As shown in fig. 9, in one embodiment, the step S308 of extracting text matching features in the bi-directional association information encoding vector matrix and generating a text matching degree identifier according to the text matching features specifically includes the following steps:
s3082, inputting the two-way association information coding vector matrix into a convolutional neural network model;
s3084, obtaining characteristic information output by a convolutional neural network model as text matching characteristics;
s3086, inputting text matching features into a full connection layer of the convolutional neural network model;
s3088, obtaining an output result of the full connection layer to obtain a text matching degree identifier; the output result is the result of matching by the full connection layer according to the text matching characteristics; the text matching degree identification comprises a text matching identification and a text non-matching identification.
Among them, convolutional neural networks (Convolutional Neural Networks, CNN) are a type of feedforward neural network that includes convolutional calculation and has a deep structure, and are one of representative algorithms of deep learning.
Specifically, the server 120 may first input the bi-directional association information encoding vector matrix to the convolutional neural network model, obtain the text matching feature output by the convolutional neural network model, input the text matching feature to the full connection layer of the convolutional neural network model, and output the feature matching result by the full connection layer, and further generate the text matching degree identifier by the server 120 according to the result to indicate whether the first text and the second text are matched, where it is to be understood that the text matching degree identifier referred to in this embodiment may be a text matching identifier represented by a value 1 and a text non-matching identifier represented by a value 0.
In the embodiment, the text matching characteristics are obtained through the convolutional neural network model, and the output result of text matching is obtained through the full-connection layer, so that the text matching degree identifier is obtained, and the accuracy of text matching can be effectively improved.
In order to facilitate a thorough understanding of embodiments of the present application by those skilled in the art, a specific example will be described below in conjunction with fig. 10. Fig. 10 is a complete flow chart of the text matching method in the present embodiment, and it can be seen from fig. 10: the server 120 firstly obtains a first text (short text 1) and a second text (short text 2), then performs word segmentation on the first text (short text 1) and the second text (short text 2) to obtain a first word sequence (word 1-N) and a second word sequence (word 1-k), then obtains a first word vector (word vector 1-N) and a second word vector (word vector 1-k) corresponding to the first word sequence (word 1-N) and the second word sequence (word 1-k) respectively by adopting a data mapping mechanism, performs information interaction on the first word vector (word vector 1-N) and the second word vector (word vector 1-k) to obtain an similarity matrix N, and the line vector sequence and the column vector sequence obtained by the similarity matrix N after line and column segmentation can be further input into a bidirectional long-short term memory network to obtain information codes output by the similarity matrix N, wherein the information codes can be used for constructing a bidirectional association information coding vector matrix and inputting the information coding matrix into a convolutional neural network to obtain characteristic information output by the convolutional neural network, and finally obtain a characteristic information output by the convolutional neural network through a full-connection layer, and finally generates a matching text by using a matching result by using a matching text system, and a matching result is matched with the text is matched text by using the second text system, and a matching result is used for matching text is matched with a recommendation text.
In the embodiment, the similarity characteristics of the surface layers between the two texts are used, and the similarity characteristics are further utilized to deeply mine and acquire the deep association information of the two texts, so that the accuracy of text matching is effectively improved.
It should be understood that, although the steps in the flowcharts of fig. 3-9 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited in the text, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 3-9 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
As shown in fig. 11, in one embodiment, a text matching apparatus 1100 is provided, where the apparatus 1100 may be disposed in a text matching system and configured to perform the above text matching method, and the text matching apparatus 1100 specifically includes: a word vector sequence acquisition module 1102, a similarity matrix acquisition module 1104, a vector matrix construction module 1106, and a matching degree identification generation module 1108, wherein:
A word vector sequence obtaining module 1102, configured to obtain a first word vector sequence of a first text, and obtain a second word vector sequence of a second text;
a similarity matrix acquisition module 1104, configured to calculate word vector similarity between the first word vector sequence and the second word vector sequence, respectively, to obtain a similarity matrix;
the vector matrix construction module 1106 is configured to obtain a row vector sequence and a column vector sequence in the similarity matrix, and construct a bidirectional association information coding vector matrix based on bidirectional association information of the row vector sequence and the column vector sequence;
the matching degree identifier generating module 1108 is configured to extract text matching features in the two-way association information encoding vector matrix, and generate a text matching degree identifier according to the text matching features; the text match identification is used to mark the match between the first text and the second text.
In one embodiment, the word vector sequence obtaining module 1102 is further configured to obtain a first text and a second text; word segmentation is carried out on the first text and the second text respectively, and a first word sequence of the first text and a second word sequence of the second text are obtained; according to the pre-stored data vector mapping relation, determining a mapping vector of the first word sequence as a first word vector sequence, and determining a mapping vector of the second word sequence as a second word vector sequence.
In one embodiment, the similarity matrix acquisition module 1104 is further configured to determine a first word vector of at least two of the first word vector sequences and determine a second word vector of at least two of the second word vector sequences; multiplying at least two first word vectors and at least two second word vectors respectively to obtain at least two word vector similarity; and constructing a matrix according to the similarity of at least two word vectors to obtain a similarity matrix.
In one embodiment, the vector matrix construction module 1106 is further configured to perform vector rank segmentation on the similarity matrix to obtain a row vector sequence and a column vector sequence; acquiring bidirectional association information of a row vector sequence and a column vector sequence through a bidirectional long-short-term memory network; and encoding the bidirectional association information, and acquiring a bidirectional association information encoding vector matrix according to the encoded bidirectional association information.
In one embodiment, the vector matrix construction module 1106 is further configured to input the row vector sequence and the column vector sequence to the bidirectional long-short-term memory network, respectively; acquiring first bidirectional association information and second bidirectional association information output by a bidirectional long-short-term memory network as bidirectional association information; the first bidirectional association information and the second bidirectional association information are information which is obtained by respectively mining the bidirectional association information according to a row vector sequence and a column vector sequence by the bidirectional long-short-term memory network.
In one embodiment, the vector matrix construction module 1106 is further configured to encode the first bidirectional association information and the second bidirectional association information, respectively, to obtain a first bidirectional association information code and a second bidirectional association information code; combining the first bidirectional association information code with the row vector sequence to obtain a first information code vector matrix, and combining the second bidirectional association information code with the column vector sequence to obtain a second information code vector matrix; and determining the first information coding vector matrix and the second information coding vector matrix as two-way association information coding vector matrices.
In one embodiment, the matching degree identification generation module 1108 is further configured to input a bi-directional correlation information encoding vector matrix into the convolutional neural network model; acquiring characteristic information output by a convolutional neural network model, and taking the characteristic information as text matching characteristics; inputting the text matching characteristics to a full connection layer of the convolutional neural network model; obtaining an output result of the full connection layer to obtain a text matching degree identifier; the output result is the result of matching by the full connection layer according to the text matching characteristics; the text matching degree identification comprises a text matching identification and a text non-matching identification.
In this embodiment, the server may calculate the word vector similarity by acquiring a first word vector sequence of the first text and a second word vector sequence of the second text, thereby obtaining a similarity matrix, and further construct a bi-directional association information encoding vector matrix by using the row vector sequence and the column vector sequence in the similarity matrix and the corresponding bi-directional association information thereof, so as to extract text matching features from the bi-directional association information encoding vector matrix, and generate a text matching degree identifier for marking the matching degree between the first text and the second text according to the text matching features. By adopting the scheme, the similarity characteristics of the surface layers between the two texts are used, and the similarity characteristics are further utilized to deeply mine and acquire the deep association information of the two texts, so that the accuracy of text matching is effectively improved.
In one embodiment, the text matching device provided by the present application may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 2. The memory of the computer device may store various program modules that make up the text matching apparatus, such as the word vector sequence acquisition module 1102, the similarity matrix acquisition module 1104, the vector matrix construction module 1106, and the matching degree identification generation module 1108 shown in fig. 11. The computer program of each program module causes a processor to execute the steps in the text matching method of each embodiment of the present application described in the present specification.
For example, the computer apparatus shown in fig. 2 may perform step S302 through the word vector sequence acquisition module 1102 in the text matching device shown in fig. 11. The computer device may perform step S304 through the similarity matrix acquisition module 1104. The computer device may perform step S306 by the vector matrix construction module 1106. The computer device may perform step S308 via the matching degree identification generation module 1108.
In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the text matching method described above. The steps of the text matching method herein may be the steps in the text matching method of the above-described respective embodiments.
In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the text matching method described above. The steps of the text matching method herein may be the steps in the text matching method of the above-described respective embodiments.
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (14)

1. A text matching method, comprising the steps of:
acquiring a first word vector sequence of a first text, and acquiring a second word vector sequence of a second text;
respectively calculating the word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix;
performing vector row-column segmentation on the similarity matrix to obtain a row vector sequence and a column vector sequence;
Acquiring bidirectional association information of the row vector sequence and the column vector sequence through a bidirectional long-short-term memory network;
coding the bidirectional association information, and acquiring a bidirectional association information coding vector matrix according to the coded bidirectional association information;
extracting text matching characteristics in the bidirectional association information coding vector matrix, and generating a text matching degree identifier according to the text matching characteristics; the text matching degree identifier is used for marking the matching degree between the first text and the second text.
2. The method of claim 1, wherein the obtaining a first word vector sequence of the first text and obtaining a second word vector sequence of the second text comprises:
acquiring a first text and a second text;
word segmentation is carried out on the first text and the second text respectively, and a first word sequence of the first text and a second word sequence of the second text are obtained;
according to a pre-stored data vector mapping relation, determining a mapping vector of the first word sequence as the first word vector sequence, and determining a mapping vector of the second word sequence as the second word vector sequence.
3. The method of claim 1, wherein the calculating word vector similarity between the first word vector sequence and the second word vector sequence, respectively, to obtain a similarity matrix comprises:
determining first word vectors of at least two of the first word vector sequences, and determining second word vectors of at least two of the second word vector sequences;
multiplying the at least two first word vectors and the at least two second word vectors respectively to obtain at least two word vector similarity;
and constructing a matrix according to the similarity of the at least two word vectors to obtain the similarity matrix.
4. The method of claim 1, wherein the obtaining, via a bidirectional long-short-term memory network, bidirectional association information of the row vector sequence and the column vector sequence comprises:
respectively inputting the row vector sequence and the column vector sequence into the two-way long-short-term memory network;
acquiring first bidirectional association information and second bidirectional association information output by the bidirectional long-short-term memory network as the bidirectional association information; the first bidirectional association information and the second bidirectional association information are information obtained by respectively mining the bidirectional association information by the bidirectional long-short-term memory network according to the row vector sequence and the column vector sequence.
5. The method of claim 1, wherein the bi-directional association information comprises a first bi-directional association information and a second bi-directional association information, wherein the encoding the bi-directional association information and obtaining the bi-directional association information encoding vector matrix according to the encoded bi-directional association information comprises:
encoding the first bidirectional association information and the second bidirectional association information respectively to obtain a first bidirectional association information code and a second bidirectional association information code;
combining the first bidirectional association information code with the row vector sequence to obtain a first information code vector matrix, and combining the second bidirectional association information code with the column vector sequence to obtain a second information code vector matrix;
and determining the first information coding vector matrix and the second information coding vector matrix as the two-way association information coding vector matrix.
6. The method of claim 1, wherein extracting text matching features in the bi-directional associated information encoding vector matrix and generating text matching degree identifiers according to the text matching features comprises:
Inputting the bidirectional association information coding vector matrix into a convolutional neural network model;
acquiring characteristic information output by the convolutional neural network model as the text matching characteristic;
inputting the text matching features to a full connection layer of the convolutional neural network model;
obtaining an output result of the full connection layer to obtain the text matching degree identification; the output result is a result of matching of the full-connection layer according to the text matching characteristics; the text matching degree identification comprises a text matching identification and a text non-matching identification.
7. A text matching device, the device comprising:
the word vector sequence acquisition module is used for acquiring a first word vector sequence of the first text and acquiring a second word vector sequence of the second text;
the similarity matrix acquisition module is used for respectively calculating the word vector similarity between the first word vector sequence and the second word vector sequence to obtain a similarity matrix;
the vector matrix construction module is used for carrying out vector row column segmentation on the similarity matrix to obtain a row vector sequence and a column vector sequence; acquiring bidirectional association information of the row vector sequence and the column vector sequence through a bidirectional long-short-term memory network; coding the bidirectional association information, and acquiring a bidirectional association information coding vector matrix according to the coded bidirectional association information;
The matching degree identification generation module is used for extracting text matching characteristics in the bidirectional association information coding vector matrix and generating text matching degree identification according to the text matching characteristics; the text matching degree identifier is used for marking the matching degree between the first text and the second text.
8. The apparatus of claim 7, wherein the word vector sequence acquisition module is further configured to acquire a first text and a second text; word segmentation is carried out on the first text and the second text respectively, and a first word sequence of the first text and a second word sequence of the second text are obtained; according to a pre-stored data vector mapping relation, determining a mapping vector of the first word sequence as the first word vector sequence, and determining a mapping vector of the second word sequence as the second word vector sequence.
9. The apparatus of claim 7, wherein the similarity matrix acquisition module is further configured to determine a first word vector of at least two of the first word vector sequences and to determine a second word vector of at least two of the second word vector sequences; multiplying the at least two first word vectors and the at least two second word vectors respectively to obtain at least two word vector similarity; and constructing a matrix according to the similarity of the at least two word vectors to obtain the similarity matrix.
10. The apparatus of claim 7, wherein the vector matrix construction module is further configured to input the row vector sequence and the column vector sequence to the two-way long-short-term memory network, respectively; acquiring first bidirectional association information and second bidirectional association information output by the bidirectional long-short-term memory network as the bidirectional association information; the first bidirectional association information and the second bidirectional association information are information obtained by respectively mining the bidirectional association information by the bidirectional long-short-term memory network according to the row vector sequence and the column vector sequence.
11. The apparatus of claim 7, wherein the bi-directional association information comprises a first bi-directional association information and a second bi-directional association information, and wherein the vector matrix construction module is further configured to encode the first bi-directional association information and the second bi-directional association information, respectively, to obtain a first bi-directional association information code and a second bi-directional association information code; combining the first bidirectional association information code with the row vector sequence to obtain a first information code vector matrix, and combining the second bidirectional association information code with the column vector sequence to obtain a second information code vector matrix; and determining the first information coding vector matrix and the second information coding vector matrix as the two-way association information coding vector matrix.
12. The apparatus of claim 7, wherein the vector matrix construction module is further configured to input the bi-directional correlation information encoding vector matrix to a convolutional neural network model; acquiring characteristic information output by the convolutional neural network model as the text matching characteristic; inputting the text matching features to a full connection layer of the convolutional neural network model; obtaining an output result of the full connection layer to obtain the text matching degree identification; the output result is a result of matching of the full-connection layer according to the text matching characteristics; the text matching degree identification comprises a text matching identification and a text non-matching identification.
13. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 6.
14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
CN202010067253.6A 2020-01-20 2020-01-20 Text matching method, text matching device, computer readable storage medium and computer equipment Active CN112749539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010067253.6A CN112749539B (en) 2020-01-20 2020-01-20 Text matching method, text matching device, computer readable storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010067253.6A CN112749539B (en) 2020-01-20 2020-01-20 Text matching method, text matching device, computer readable storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN112749539A CN112749539A (en) 2021-05-04
CN112749539B true CN112749539B (en) 2023-09-15

Family

ID=75645130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010067253.6A Active CN112749539B (en) 2020-01-20 2020-01-20 Text matching method, text matching device, computer readable storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN112749539B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600580B (en) * 2022-11-29 2023-04-07 深圳智能思创科技有限公司 Text matching method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776545A (en) * 2016-11-29 2017-05-31 西安交通大学 A kind of method that Similarity Measure between short text is carried out by depth convolutional neural networks
CN108846077A (en) * 2018-06-08 2018-11-20 泰康保险集团股份有限公司 Semantic matching method, device, medium and the electronic equipment of question and answer text
CN110196906A (en) * 2019-01-04 2019-09-03 华南理工大学 Towards financial industry based on deep learning text similarity detection method
CN110348014A (en) * 2019-07-10 2019-10-18 电子科技大学 A kind of semantic similarity calculation method based on deep learning
WO2019235103A1 (en) * 2018-06-07 2019-12-12 日本電信電話株式会社 Question generation device, question generation method, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776545A (en) * 2016-11-29 2017-05-31 西安交通大学 A kind of method that Similarity Measure between short text is carried out by depth convolutional neural networks
WO2019235103A1 (en) * 2018-06-07 2019-12-12 日本電信電話株式会社 Question generation device, question generation method, and program
CN108846077A (en) * 2018-06-08 2018-11-20 泰康保险集团股份有限公司 Semantic matching method, device, medium and the electronic equipment of question and answer text
CN110196906A (en) * 2019-01-04 2019-09-03 华南理工大学 Towards financial industry based on deep learning text similarity detection method
CN110348014A (en) * 2019-07-10 2019-10-18 电子科技大学 A kind of semantic similarity calculation method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Text Matching as Image Recognition;Liang Pang 等;《https://arxiv.org/pdf/1602.06359.pdf》;1-8 *
基于卷积神经网络的中文句子相似度计算;孙阳;《中国优秀硕士学位论文全文数据库信息科技辑》(第08期);I138-1277 *

Also Published As

Publication number Publication date
CN112749539A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN110162669B (en) Video classification processing method and device, computer equipment and storage medium
CN110598206B (en) Text semantic recognition method and device, computer equipment and storage medium
CN109783655B (en) Cross-modal retrieval method and device, computer equipment and storage medium
CN108563782B (en) Commodity information format processing method and device, computer equipment and storage medium
WO2020258506A1 (en) Text information matching degree detection method and apparatus, computer device and storage medium
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN110569500A (en) Text semantic recognition method and device, computer equipment and storage medium
CN110737818B (en) Network release data processing method, device, computer equipment and storage medium
CN111859916B (en) Method, device, equipment and medium for extracting key words of ancient poems and generating poems
CN112231224A (en) Business system testing method, device, equipment and medium based on artificial intelligence
CN110750523A (en) Data annotation method, system, computer equipment and storage medium
CN110362798B (en) Method, apparatus, computer device and storage medium for judging information retrieval analysis
CN112766319A (en) Dialogue intention recognition model training method and device, computer equipment and medium
CN113627207B (en) Bar code identification method, device, computer equipment and storage medium
CN109460541B (en) Vocabulary relation labeling method and device, computer equipment and storage medium
CN111191028A (en) Sample labeling method and device, computer equipment and storage medium
CN113886550A (en) Question-answer matching method, device, equipment and storage medium based on attention mechanism
CN113449489A (en) Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
CN111709229A (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN112749539B (en) Text matching method, text matching device, computer readable storage medium and computer equipment
CN113254687B (en) Image retrieval and image quantification model training method, device and storage medium
CN113342927B (en) Sensitive word recognition method, device, equipment and storage medium
CN114328898A (en) Text abstract generating method and device, equipment, medium and product thereof
CN113961666A (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN112732884A (en) Target answer sentence generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40048284

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant