CN112749252A

CN112749252A - Text matching method based on artificial intelligence and related device

Info

Publication number: CN112749252A
Application number: CN202010675118.XA
Authority: CN
Inventors: 吴嫒博; 刘萌; 蔡晓凤; 李超; 卢鑫鑫; 刘晓靖; 肖世伟; 孙朝旭; 张艺博; 滕达; 付贵; 周伟强; 王静; 崔立鹏; 叶礼伟; 曹云波; 关俊辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2021-05-04
Anticipated expiration: 2040-07-14
Also published as: CN112749252B

Abstract

The embodiment of the application discloses a text matching method and a related device based on artificial intelligence, which at least relate to natural language processing technology and machine learning technology in the artificial intelligence, and the method comprises the following steps: acquiring a first text and a second text, wherein the first text is the subject of the target text, and the second text is determined according to the body of the target text; the length difference between the first text and the second text is larger than a threshold value; determining input data of a first regression model according to the first text and the second text, and determining the similarity of the first text and the second text through the first regression model; and determining the matching degree of the first text and the second text according to the similarity. According to the embodiment of the application, the matching result of the matched text is not obtained through the classification model, but the matching degree of the matched text is obtained through the regression model. The output result of the model is continuous data, and the similarity between the first text and the second text can be reflected better, so that the matching degree obtained according to the similarity is more accurate.

Description

Text matching method based on artificial intelligence and related device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text matching method and a related apparatus based on artificial intelligence.

Background

Text matching, namely judging the similarity of the contents of two matched texts, wherein the larger the similarity value is, the more semantically similar the two matched texts are. Text matching can be subdivided into three types according to text length, namely short text to short text matching, short text to long text matching and long text to long text matching. Short text generally refers to a form of text that is relatively short in length, typically no more than 160 characters. The short text can be, for example, chat information, opinion comments, short message, document summary, and the like. The long text may be, for example, news, literature, treatises, and the like.

The existing short text and short text matching method generally converts two matched texts into word vectors respectively, then calculates the cosine distance between the word vectors of the two matched texts so as to obtain the similarity of the two matched texts, and finally judges whether the similarity is greater than a preset threshold value, if so, the output is 1, and the similarity represents that the two matched texts are similar; if not, the output is 0, which indicates that the two matching texts are not similar.

When the method is applied to matching of short texts and long texts, the content and the length of the two texts are greatly different, so that the similarity degree between the matched texts cannot be quantized no matter whether the result is 0 or 1, and the accuracy of the text matching result is reduced.

Disclosure of Invention

In order to solve the technical problem, the application provides a text matching method based on artificial intelligence and a related device, which are used for improving the accuracy of matching results when the lengths of the matched texts are different greatly.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a text matching method based on artificial intelligence, where the method includes:

acquiring a first text and a second text, wherein the first text is a subject of a target text, and the second text is determined according to a body of the target text; the length difference between the first text and the second text is greater than a threshold value;

determining input data of a first regression model according to the first text and the second text, and determining the similarity of the first text and the second text through the first regression model;

and determining the matching degree of the first text and the second text according to the similarity.

In another aspect, an embodiment of the present application provides a text matching apparatus based on artificial intelligence, where the apparatus includes: the device comprises an acquisition unit, a similarity unit and a matching degree unit;

the acquiring unit is used for acquiring a first text and a second text, wherein the first text is a subject of a target text, and the second text is determined according to a body of the target text; the length difference between the first text and the second text is greater than a threshold value;

the similarity unit is used for determining input data of a first regression model according to the first text and the second text, and determining the similarity between the first text and the second text through the first regression model;

and the matching degree unit is used for determining the matching degree of the first text and the second text according to the similarity.

In another aspect, an embodiment of the present application provides an apparatus for text matching, where the apparatus includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method of the above aspect according to instructions in the program code.

In another aspect, the present application provides a computer-readable storage medium for storing a computer program for executing the method of the above aspect.

According to the technical scheme, after the first text and the second text are obtained, the matching result of the matched text is not obtained through the classification model, but the matching degree of the matched text is obtained through the regression model. Specifically, the similarity of the first text and the second text is determined through a first regression model, and the matching degree of the first text and the second text is obtained through the similarity. Although the content and the length of the first text and the second text are greatly different, the matching result obtained by adopting the regression model is not two results of non-0, namely 1, but a real number in an interval of 0 to 1, for example, and the matching degree between the first text and the second text can be more accurately reflected by continuous data compared with discrete data. Meanwhile, compared with a classification model, the loss function of the regression model is smoother, the distribution of the output result is more dispersed, namely, the similarity is more dispersed instead of being accumulated in a fixed interval, the discrimination of the similarity is increased, and therefore the matching degree obtained according to the similarity is more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of a text matching method according to an embodiment of the present application;

fig. 2 is a flowchart of a text matching method according to an embodiment of the present application;

fig. 3 is a flowchart of another text matching method provided in the embodiment of the present application;

fig. 4 is a flowchart of a method for determining a text matching degree according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a method for determining whether a composition runs a question according to an embodiment of the present disclosure;

FIG. 6 is a diagram illustrating a composition modification result provided in an embodiment of the present application;

fig. 7 is a block diagram illustrating a structure of a text matching apparatus according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an apparatus for text matching according to an embodiment of the present disclosure;

fig. 9 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The existing matching method of the short text and the short text generally adopts a classification model, and the output result is not 0, namely 1. It is found through research that when the matching method is applied to matching short texts with long texts, the similarity result generally falls in a fixed interval, for example, between 0.8 and 0.95, because the lengths and contents of the short texts and the long texts are different greatly. Because the similarity result is not changed greatly and the discrimination is low, the setting of the preset threshold value will have a great influence on the text matching result. For example, when the preset threshold is set to be low, the majority of the output text matching results is 1; when the set threshold is higher, the output text matching result is mostly 0.

Therefore, by adopting the classification model, the obtained similarity result has little change and the discrimination is low, so that the text matching result is difficult to obtain by definitely presetting the threshold value, and the accuracy of the matching result is reduced. Moreover, the text matching results are only two, namely the matching texts are similar or dissimilar, but the similarity degree between the matching texts cannot be quantified whether the matching texts are similar or dissimilar, so that the accuracy of the text matching results is reduced.

Based on this, the embodiment of the present application provides a text matching method, which obtains a matching degree between a first text and a second text through a first regression model. Wherein the first text is a short text and the second text is a long text. Although the content and the length of the short text and the long text are greatly different, the matching result obtained by adopting the regression model is not two results of 0, namely 1, but real numbers in the interval of 0 and 1, for example, and the matching degree between the short text and the long text can be more accurately reflected by continuous data compared with discrete data. Meanwhile, compared with a classification model, the loss function of the first regression model is smoother, the similarity results are not distributed and dispersed, but are not accumulated in a fixed interval, the discrimination of the similarity results is improved, and the matching degree obtained according to the similarity is more accurate.

The text matching method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, method, technology and application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above natural language processing technology direction.

For example, Text preprocessing (Text preprocessing) and Semantic understanding (Semantic understating) in Natural Language Processing (NLP) may be involved, including word, sentence segmentation (word/Semantic parsing), part-of-speech tagging (word tagging), sentence classification (word/Semantic parsing), morphological analysis (Semantic analyzing) Syntactic analysis (Semantic analyzing) Semantic analysis (Semantic analyzing) pragmatic analysis (textual analyzing) Semantic reasoning (Semantic analyzing) emotion analysis (Semantic analyzing), and the like.

In order to facilitate understanding of the technical solution of the present application, the text matching method provided in the embodiment of the present application is introduced below with reference to an actual application scenario.

The text matching method provided by the embodiment of the application can be applied to data processing equipment with text matching capability, such as terminal equipment and a server. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like; the server may specifically be an independent server, or may also be a cluster server.

The data processing device can have Natural Language Processing (NLP) capability, and the NLP is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

In the embodiment of the application, the data processing device obtains various types of information such as the similarity and the matching degree of the first text and the second text through an NLP technology.

The data processing apparatus may be provided with Machine Learning (ML) capabilities. ML is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks.

In order to facilitate understanding of the technical solution of the present application, the text matching method provided in the embodiment of the present application will be described below with reference to an actual application scenario.

Referring to fig. 1, the figure is a schematic view of an application scenario of the text matching method provided in the embodiment of the present application. In this practical application scenario, the processing device is the server 101. The practical application scene can be applied to correction texts, an important index in the correction texts is whether the texts run, and if the texts run, a lower score is obtained. Whether the composition runs the question or not can be obtained by judging whether the text content of the composition is matched with the question or not, if the matching degree of the text content and the question is higher, the possibility that the composition does not run the question is higher, and the possibility that a higher score is obtained is higher; if the matching degree of the text content and the title is lower, the possibility of composition running the title is higher, and the possibility of obtaining a lower score is higher.

The server 101 may obtain the text content and the title of the composition by obtaining the composition. The composition to be corrected can directly upload the composition text content to the server 101 through the terminal device, or upload the composition text picture to the server 101 through the terminal device photographing, and the like. The terminal device may be an intelligent terminal, a computer, a Personal Digital Assistant (PDA for short), a tablet computer, or the like.

As shown in fig. 1, the target text is a text for which a matching degree is to be obtained, and the target text is related to the first text and the second text. The first text represents the subject of the target text, such as the title of the composition, the writing requirement of the composition, and the like, and the second text represents the body content of the target text, such as the body content of the composition, the content of a certain segment of the body, and the like.

After obtaining the first text and the second text, the server 101 determines input data of the first regression model according to the first text and the second text, where the input data may be, for example, text data or a feature vector of the text. And after the input data is input into the first regression model, obtaining the similarity between the first text and the second text. And obtaining the matching degree of the first text and the second text according to the similarity.

The first regression model is obtained by training continuous data according to the label, the output result of the model is a continuous numerical type, and the similarity between the first text and the second text can be represented. The similarity represents the similarity of the first text and the second text in content, and the matching degree can reflect the similarity of the first text and the second text better than the matching result of non-0, namely 1. For example, if the matching degree is a real number between 0 and 1, the closer the obtained matching result is to 1, the more similar the first text and the second text are, the lower the possibility of composition running questions is; similarly, the closer the matching result is to 0, the more dissimilar the first text is to the second text, and the more likely the composition runs.

Meanwhile, compared with a classification model, the loss function of the regression model is smoother, and the distribution of the output result is more dispersed, namely the distribution of the similarity result is more dispersed instead of being accumulated in a fixed interval, so that the discrimination of the similarity result is increased, and the matching degree obtained according to the similarity is more accurate.

Next, a text matching method provided by an embodiment of the present application will be described with reference to the drawings.

Referring to fig. 2, fig. 2 shows a flow chart of a text matching method, the method comprising:

s201: acquiring a first text and a second text, wherein the first text is a subject of a target text, and the second text is determined according to a body of the target text; the first text and the second text differ in length by more than a threshold.

The embodiment of the application is applied to a scene that a short text is matched with a long text, so that the difference between the length of the first text and the length of the second text is large. For example, the first text is a title of a composition, generally not more than ten characters, the second text is a full text of the composition, generally about eight hundred characters, and it can be seen that the length of the first text is greatly different from that of the second text. The length difference between the first text and the second text is greater than the threshold, and the size of the threshold is not specifically limited in the embodiment of the present application, and can be set by a person skilled in the art according to actual needs.

The first text is the subject of the target text, and the second text is determined according to the body of the target text. For example, the target text is a composition, the first text is a title of the composition, the second text is body content of the composition, both the first text and the second text are related to the target text, and whether the target text runs the question or not is judged by acquiring the matching degree of the first text and the second text.

S202: and determining input data of a first regression model according to the first text and the second text, and determining the similarity of the first text and the second text through the first regression model.

Because the length difference between the first text and the second text is large, and the features included in the second text are far larger than those included in the first text, the similarity result obtained by the existing classification model is accumulated in a fixed interval with a small range, so that the discrimination is low, the output result of the classification model is difficult to obtain by clearly presetting a threshold value, and the accuracy of the matching result is reduced.

In view of this, in order to disperse the distribution of the similarity and increase the degree of distinction, the embodiment of the present application does not use the classification model, but uses the regression model, i.e., the first regression model. The first regression model may be, for example, a logistic regression model, a polynomial regression model, or the like.

After the first text and the second text are obtained, the input data of the first regression model is determined according to the first text and the second text, for example, word vector features of the first text and the second text can be respectively extracted, and the word vector features are used as the input data.

After determining the input data of the first regression model, inputting the input data into the first regression model, and predicting the similarity degree of the first text and the second text through the first regression model. The output result of the first regression model is a continuous numerical value representing the similarity between the first text and the second text. Compared with a matching result which is not 0, namely 1, the matching degree can reflect the similarity of the first text and the second text.

The first regression model is a pre-trained regression model, and a better fitting effect is obtained by adjusting parameters through continuous label values, wherein the data type of the similarity label is continuous data, and the data type of the output result is continuous data. Compared with a classification model, the output results of the regression model are distributed more dispersedly, namely, the similarity results are distributed more dispersedly instead of being accumulated in a fixed interval, so that the discrimination of the similarity results is increased.

S203: and determining the matching degree of the first text and the second text according to the similarity.

After the more accurate similarity is obtained, the matching degree of the first text and the second text obtained through the similarity is more accurate. For example, the similarity may directly represent the matching degree of the first text and the second text, and the higher the similarity is, the higher the matching degree of the first text and the second text is; the lower the similarity, the lower the degree of matching of the first text and the second text.

The Bidirectional coding representation (BERT) model based on deformation can be used for training deep Bidirectional representation in advance by jointly adjusting the contexts in all layers, and is applied to the matching scene of short texts and short texts, so that the accuracy of the result is high.

However, research shows that when the BERT model is applied to a scene matched with a short text and a long text, obtained similarity results are accumulated in an interval of 0.80-0.95, and the similarity results are not changed greatly, so that whether the short text is similar to the long text or not is difficult to distinguish through a clear threshold. This is because the results output by the BERT model have only two results, either 0 indicating dissimilarity or 1 indicating similarity, i.e., the data type output is discrete data. Discrete data cannot accurately quantify the degree of similarity, for example, it cannot be concluded that short text is 80% similar to long text.

In order to disperse similarity distribution and increase the discrimination, the embodiment of the application constructs a regression model by modifying a BERT network so as to improve the accuracy of the matching degree of the short text and the long text. The regression model determined based on the BERT model is specifically described below.

Firstly, as the output result of the regression model is continuous data, the label needs to be re-labeled, and the label is changed from the originally discrete 0 or 1 to a real value between 0 and 1, and is used for representing the similarity between the first text and the second text. And by thinning the label value, a more accurate similarity result is obtained.

Secondly, the loss function is changed from the original cross entropy function to a mean-square error (MSE) function. The principle of the MSE function is the average value of the square sum of the errors of corresponding points of the predicted data and the real data, the loss precision of the predicted data and the real data can be improved, and the discrimination of the similarity result can be increased by using the square function, so that the similarity result is prevented from being accumulated in a fixed interval with a small range.

Finally, the activation function of the network is changed from the normalized index Softmax function to the logistic Sigmoid function. The Softmax function is adopted in the BERT network to calculate the cross entropy conveniently, the cross entropy function is changed into the MSE function in the embodiment of the application, and the Softmax function is not needed. And the Sigmoid function is an S-shaped function, the output of the function is mapped between 0 and 1, monotonously and continuously, the output range is limited, the optimization is stable, and when the Sigmoid function is used as an output layer, the output similarity result can be mapped between 0 and 1, so that the similarity result is prevented from being accumulated in a fixed interval with a smaller range.

The above describes a regression model determined based on the BERT model in a principle level, and the following describes a regression model determined based on the BERT model in a code level.

Py code, three modifications are required. First, the tag value is modified from label _ ids (int32) to vals (float32), i.e., the tag value is changed from 32-bit integer to 32-bit floating point. Then, the log _ softmax operation of the last layer of the network on the full connection result logits is modified into the sigmoid operation. And finally, modifying the loss function in the code into an MES function from a cross entropy function.

And after the log _ softmax operation is modified into the sigmoid operation, outputting a vector with the result of 1 x 2, wherein the value range is between 0 and 1. In order to be able to accurately output the similarity of the first text and the second text, the second dimension of the vector is eliminated by the squeeze function, so that the output of the model is a real number between 0 and 1.

A text matching method using a regression model determined based on the BERT model will be described below with reference to fig. 3.

Referring to fig. 3, fig. 3 shows a flow chart of yet another text matching method, the method comprising:

s301: acquiring a first text and a second text, wherein the first text is a subject of a target text, and the second text is determined according to a body of the target text; the first text and the second text differ in length by more than a threshold.

S302: inputting the first text and the second text into the first regression model, extracting a feature vector through the first regression model, and determining the similarity of the first text and the second text.

The first text and the second text can be directly input into the regression model determined based on the BERT model, and the regression model can automatically extract the characteristics of the first text and the second text according to the context information and determine the similarity of the first text and the second text.

The regression model can automatically extract the feature vectors of the first text and the second text, and then the feature vectors are input into the regression model. Therefore, feature vectors do not need to be extracted manually, and artificial subjective factors are reduced through the features extracted by the model, so that the accuracy is improved, and the labor cost is reduced.

S303: and determining the matching degree of the first text and the second text according to the similarity.

According to the scheme, after the first text and the second text are obtained, the first text and the second text are input into a regression model determined based on BERT, the feature vectors are automatically extracted through the regression model, the similarity is obtained, and the matching degree of the first text and the second text is obtained through the similarity. Although the content and the length of the first text and the second text are different greatly, the difference of the length between the first text and the second text can be weakened by adopting the regression model. The label of the model is a real number between 0 and 1, the label value is more refined than 0 or 1, the obtained loss function curve is smoother, the effect of the fitting similarity is better, the distribution of the output result is more dispersed, namely, the similarity is more dispersed, the similarity is not accumulated in a fixed interval, the discrimination of the similarity is increased, and the matching degree obtained according to the similarity is more accurate.

The matching degree is related to not only the similarity between the first text and the second text, but also other factors in the text, such as the number of times of occurrence of the keyword in the first text in the second text, and if the number of times of occurrence of the keyword is greater, the matching degree between the first text and the second text is higher.

Determining the similarity of the first text and the second text according to the multi-feature dimension will be described with reference to fig. 4.

Referring to fig. 4, fig. 4 shows a flow chart of a method of determining a text match, the method comprising:

s401: and determining the association attribute of the target text according to the text content contained in the theme and the body, wherein the association attribute is used for embodying the association degree of the theme and the body in the text dimension.

In order to determine the matching degree of the first text and the second text through the multi-dimensional features, the association attribute of the target text can be obtained, and the association attribute represents the association degree of the subject and the body of the target text in the text dimension.

In order to facilitate understanding for those skilled in the art, two associated attributes are exemplified below.

The association attribute one: the longest common subsequence ratio of topic to text content.

The more times a word in a topic appears in the text content, the more the topic is considered to match the text content. I.e. the better the first text matches the second text. This characteristic may be as follows:

and the association attribute II: the number of times a keyword of a topic appears in the text content.

The more times a keyword in a topic appears in text content, the more matching the topic is considered to be with the text content. I.e. the better the first text matches the second text. This characteristic may be as follows:

the number of times of occurrence of the keywords in the text content in the associated attribute of two

In order to more accurately obtain the keywords in the theme, operations such as word segmentation and word removal from the theme may be performed, for example, a jieba word segmentation tool is used to segment the theme, and then the word removal from the theme is performed, for example, the words such as "yes" and "no" are not meaningful, so as to obtain the keywords.

The content of the associated attribute is not specifically limited in the embodiment of the present application, for example, the associated attribute may also be a genre of the theme and a structure of the content. For example, the discussion adopts the structure of "total-minute-total", and the Manchu-Caulii-Cauca-scattering in the Manchu is gorgeous, and has a scattered shape but a non-scattered spirit.

S402: and determining input parameters of a second regression model according to the correlation attributes and the similarity, and determining the matching degree of the first text and the second text through the second regression model.

The number of association attributes may be one or more. After the correlation attributes and the similarity are obtained, determining input parameters of a second regression model according to the correlation attributes and the similarity, and obtaining the matching degree of the first text and the second text through the second regression model.

The kind of the second regression model is not specifically limited in the embodiments of the present application, and those skilled in the art can select the second regression model according to actual needs.

The second regression model is taken as a logistic regression model for example.

Inputting the at least one correlation attribute and the similarity into a logistic regression model, which may be as follows:

wherein, x is the input data of the model, y is the output data of the model, theta is the weight parameter of the model, and the weight parameter is obtained by adopting the gradient descent method for training according to the deviation of the predicted value and the true value of the sample data.

The function is an S-shaped curve, and can take the output value between [0 and 1], and the function value is close to 0 or 1 quickly at the place far from 0. Therefore, 0.5 can be used as a threshold, and when the y value is larger than 0.5, the first text is matched with the second text; when the y value is less than 0.5, the first text does not match the second text.

And a result of whether the first text is matched with the second text is obtained by adopting a logistic regression function, so that the calculation amount is very small, the speed is high, and the storage resource is saved.

The following describes the process of applying the text matching method to determine whether a composition runs a question with reference to fig. 5.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a method for determining whether a composition runs a question.

In determining whether a composition runs on a topic, it is generally determined whether the composition content is being composed around a composition title, and then the first text and the second text in the composition text are the composition title and the composition content, respectively. And obtaining the similarity of the composition title and the composition content through the first regression model. Obtaining the associated attributes of the composition text, in the embodiment of the application, it is assumed that two associated attributes are obtained, namely the longest common subsequence ratio of the composition title and the composition content, and the number of times the keyword of the composition title appears in the composition content. And inputting the similarity and the two correlation attributes into a second regression model to obtain a detection result of whether the composition text runs the questions or not. When the text matching method is applied to composition correction, namely when the target text is the composition to be corrected, the score of the composition is not only related to whether the composition runs the questions, but also related to the chapter structure of the composition. For example, when the target text genre is a treatise, the score of the composition in which the chapter structure conforms to the "total-score-total" structure will be higher. Specifically, the matching degree between the first text and each second text is obtained by matching the content of each natural segment in the target text body with the theme, that is, the second text is the text content of one natural segment contained in the body. The paragraph with high matching degree is the 'total' paragraph, the paragraph with low matching degree is the 'sub' paragraph, if the text content structure of the target text obtained by the matching degree conforms to the 'total-sub-total' structure, the chapter structure of the target text conforms to the structure of the discussion paper, and a higher score can be obtained to a certain degree.

The score of the composition is also related to whether the composition tissue is strict, if the composition tissue is strict, the cohesiveness is stronger, and a higher score can be obtained. The cohesiveness of the composition can be embodied by whether the adjacent paragraphs are matched, and if the matching degree of the adjacent paragraphs is higher, the cohesiveness of the composition is stronger. Specifically, first, a second text and a third text are obtained, where the second text is a text content of a natural segment included in a target text body, and the third text is a text content of one natural segment of two natural segments adjacent to the second text. Then, input data of the first regression model is determined according to the second text and the third text, and the similarity of the second text and the third text is determined through the first regression model. And finally, determining the matching degree of the second text and the third text according to the similarity.

Next, a text matching method provided in the embodiment of the present application will be described in conjunction with an actual application scenario. In practical application scenarios, the text matching method can be applied to composition correction, an important criterion for making composition scores higher or lower is whether composition runs, and if composition runs, a lower score is directly obtained.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating a composition approval result. Wherein, the composition score is obtained by integrating the center, structure, content, expression and the like of the composition. The following are described separately.

First, whether the composition conforms to the center can be obtained by judging whether the composition runs the questions. After the theme of the composition and the text of the composition are obtained, the input parameters of the first regression model are determined according to the theme of the composition and the text of the composition, then the input parameters are input into the first regression model, the similarity between the theme of the composition and the text of the composition is obtained through the first regression model, and therefore the matching degree between the theme of the composition and the text of the composition is obtained according to the similarity. The higher the matching degree is, the lower the possibility of composition running questions is, and the score of the composition is relatively higher; the lower the degree of matching, the higher the likelihood of composition runs, and the lower the score of the composition.

Further, in order to improve the accuracy of whether the composition runs the questions or not, the composition can be evaluated through multi-dimensional features, the multi-dimensional features comprise similarity and at least one associated attribute, and the more the associated attributes are, the more comprehensive the composition is measured to judge whether the composition runs the questions or not. The associated attribute may be, for example, the longest common subsequence ratio of the composition topic to the composition content, the number of times a keyword of the composition topic appears in the composition content, and the like. And determining input parameters of the second regression model according to the multidimensional characteristics, and determining the matching degree of the composition subject and the composition text through the second regression model.

Secondly, whether the structure of the composition conforms to the subject matter can be obtained through the chapter structure of the composition. For example, when the target text genre is a treatise, the score of the composition in which the chapter structure conforms to the "total-score-total" structure will be higher. Specifically, the content of each natural segment in the composition text is matched with the composition theme, so that the matching degree between the content of each natural segment in the composition text and the composition theme is obtained. The paragraph with high matching degree is the "total" paragraph, and the paragraph with low matching degree is the "sub" paragraph.

Then, whether the composition is strict or not can be obtained by the cohesion of the composition. The cohesiveness is stronger, and a higher fraction can be obtained. The cohesiveness of the composition can be embodied by whether the adjacent paragraphs are matched, and if the matching degree of the adjacent paragraphs is higher, the cohesiveness of the composition is stronger. Specifically, the similarity of every two adjacent natural segments is obtained through a first regression model, and the matching degree of every two adjacent natural segments is obtained through the similarity.

Finally, whether the expression of the composition is elegant and full can be analyzed by whether the words contained in the text of the composition body have continuity. Specifically, the similarity between the vocabularies is obtained through the first regression model, and the matching degree between the vocabularies is obtained through the similarity, wherein the higher the matching degree is, the higher the continuity between the vocabularies is.

Based on the text matching method provided in the foregoing embodiment, an embodiment of the present application further provides a text matching apparatus 600, referring to fig. 7, where fig. 7 is a structural block diagram of the text matching apparatus provided in the embodiment of the present application, and the apparatus 600 includes an obtaining unit 601, a similarity unit 602, and a matching degree unit 603;

an obtaining unit 601, configured to obtain a first text and a second text, where the first text is a subject of a target text, and the second text is determined according to a body of the target text; the length difference between the first text and the second text is larger than a threshold value;

a similarity unit 602, configured to determine input data of a first regression model according to the first text and the second text, and determine a similarity between the first text and the second text through the first regression model;

a matching degree unit 603, configured to determine a matching degree between the first text and the second text according to the similarity.

In a possible implementation manner, the similarity unit 602 is further configured to input the first text and the second text into a regression model, extract a feature vector through the regression model, and determine the similarity between the first text and the second text.

In a possible implementation manner, the matching degree unit 603 is further configured to determine an association attribute of the target text according to text contents included in the subject and the body, where the association attribute is used to represent an association degree between the subject and the body in a text dimension;

and determining input parameters of a second regression model according to the correlation attributes and the similarity, and determining the matching degree of the first text and the second text through the second regression model.

In a possible implementation manner, the obtaining unit 601 is further configured to obtain the second text and a third text, where the second text is a text content of a natural segment included in the body text, and the third text is a text content of one natural segment in natural segments adjacent to the second text;

the similarity unit 602 is further configured to determine input data of a first regression model according to the second text and the third text, and determine a similarity between the second text and the third text through the first regression model;

the matching degree unit 603 is further configured to determine a matching degree between the second text and the third text according to the similarity.

According to the technical scheme, after the obtaining unit obtains the first text and the second text, the similarity unit determines the similarity between the first text and the second text through the first regression model, and the matching degree unit obtains the matching degree between the first text and the second text through the similarity. Although the content and the length of the first text and the second text are greatly different, the matching result obtained by adopting the regression model is not two results of non-0, namely 1, but a real number in an interval of 0 to 1, for example, and the matching degree between the first text and the second text can be more accurately reflected by continuous data compared with discrete data. Meanwhile, compared with a classification model, the loss function of the regression model is smoother, the distribution of the output result is more dispersed, namely, the similarity is more dispersed instead of being accumulated in a fixed interval, the discrimination of the similarity is increased, and therefore the matching degree obtained according to the similarity is more accurate.

The embodiment of the application also provides a device for text matching, which is described below with reference to the accompanying drawings. Referring to fig. 8, an embodiment of the present application provides a device, which may also be a terminal device, where the terminal device may be any intelligent terminal including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the terminal device is taken as the mobile phone as an example:

fig. 8 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 8, the handset includes: radio Frequency (RF) circuit 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuit 760, wireless fidelity (WiFi) module 770, processor 780, and power supply 790. Those skilled in the art will appreciate that the handset configuration shown in fig. 8 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 8:

the RF circuit 710 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 780; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 710 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 710 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 720 may be used to store software programs and modules, and the processor 780 may execute various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 720. The memory 720 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also referred to as a touch screen, can collect touch operations of a user (e.g. operations of the user on or near the touch panel 731 by using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 731 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 780, and can receive and execute commands from the processor 780. In addition, the touch panel 731 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 730 may include other input devices 732 in addition to the touch panel 731. In particular, other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 740 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 740 may include a Display panel 741, and optionally, the Display panel 741 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-Emitting Diode (OLED), or the like. Further, the touch panel 731 can cover the display panel 741, and when the touch panel 731 detects a touch operation on or near the touch panel 731, the touch operation is transmitted to the processor 780 to determine the type of the touch event, and then the processor 780 provides a corresponding visual output on the display panel 741 according to the type of the touch event. Although in fig. 8, the touch panel 731 and the display panel 741 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 731 and the display panel 741 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 750, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 741 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 741 and/or a backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 760, speaker 761, and microphone 762 may provide an audio interface between a user and a cell phone. The audio circuit 760 can transmit the electrical signal converted from the received audio data to the speaker 761, and the electrical signal is converted into a sound signal by the speaker 761 and output; on the other hand, the microphone 762 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 760, and then processes the audio data output processor 780, and then transmits the audio data to, for example, another cellular phone through the RF circuit 710, or outputs the audio data to the memory 720 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 770, and provides wireless broadband Internet access for the user. Although fig. 8 shows the WiFi module 770, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 780 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby integrally monitoring the mobile phone. Optionally, processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 780.

The handset also includes a power supply 790 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 780 via a power management system, so that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 780 included in the terminal device further has the following functions:

acquiring a first text and a second text, wherein the first text is the subject of the target text, and the second text is determined according to the body of the target text; the length difference between the first text and the second text is larger than a threshold value;

Referring to fig. 9, fig. 9 is a block diagram of a server 800 provided in this embodiment, and the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 842 or data 844. Memory 832 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 822 may be provided in communication with the storage medium 830 for executing a series of instruction operations in the storage medium 830 on the server 800.

The server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input-output interfaces 858, and/or one or more operating systems 841, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.

The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is used to execute any one implementation manner of the text matching method described in the foregoing embodiments.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A text matching method based on artificial intelligence is characterized by comprising the following steps:

2. The method of claim 1, wherein the first regression model is a regression model determined based on a deformed bi-directional code characterization BERT model; wherein, the loss function of the regression model is a mean square error function, and the activation function is a logistic Sigmoid function.

3. The method of claim 2, wherein determining input data of a first regression model according to the first text and the second text, and determining similarity between the first text and the second text through the first regression model comprises:

inputting the first text and the second text into the first regression model, extracting a feature vector through the first regression model, and determining the similarity of the first text and the second text.

4. The method according to any one of claims 1-3, further comprising:

determining the association attribute of the target text according to the text content contained in the subject and the body, wherein the association attribute is used for embodying the association degree of the subject and the body in the text dimension;

the determining the matching degree of the first text and the second text according to the similarity comprises:

5. The method of claim 4, wherein the association attribute comprises at least one of a longest common subsequence ratio of the topic to the text content, and a number of times a keyword of the topic appears in the text content.

6. The method according to claim 1, wherein when the target text is a composition to be amended, the second text is a text content of at least one natural segment included in the body.

7. The method according to claim 1, wherein when the target text is a composition to be amended, the method further comprises:

acquiring the second text and a third text, wherein the second text is the text content of a natural segment contained in the text body, and the third text is the text content of one natural segment in the natural segments adjacent to the second text;

determining input data of a first regression model according to the second text and the third text, and determining the similarity of the second text and the third text through the first regression model;

and determining the matching degree of the second text and the third text according to the similarity.

8. An artificial intelligence based text matching apparatus, comprising: the device comprises an acquisition unit, a similarity unit and a matching degree unit;

9. An apparatus for text matching, the apparatus comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-7 according to instructions in the program code.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any one of claims 1-7.