CN112131345A

CN112131345A - Text quality identification method, device, equipment and storage medium

Info

Publication number: CN112131345A
Application number: CN202011003717.3A
Authority: CN
Inventors: 朱灵子; 衡阵; 马连洋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2020-12-25
Anticipated expiration: 2040-09-22
Also published as: CN112131345B

Abstract

The application discloses a text quality identification method, a text quality identification device, text quality identification equipment and a storage medium, and relates to the field of deep learning. The method comprises the following steps: acquiring a text vector of a text, wherein the text vector of the text at least comprises one of a title text vector and a body text vector, the title text vector is a vector corresponding to a title of the text, and the body text vector is a vector corresponding to a body of the text; obtaining an image vector corresponding to the keywords in the text, wherein the image vector is obtained after the keywords are subjected to image embedding processing; classifying the title text vector, the text vector and the image vector to obtain the quality grade prediction probability corresponding to the text; and dividing the quality grade of the text quality according to the quality grade prediction probability. The text is converted into the co-occurrence relation structure chart, and the co-occurrence relation structure chart is subjected to chart embedding processing by using a random walk algorithm, so that the recognition capability of the text quality recognition model on the text quality is improved.

Description

Text quality identification method, device, equipment and storage medium

Technical Field

The present application relates to the field of deep learning, and in particular, to a method, an apparatus, a device, and a storage medium for text quality recognition.

Background

With the development of internet technology, the information received by users is exponentially multiplied, and various information publishing platforms increase the access amount and reading amount of the platforms by publishing high-quality articles.

When an article is published on an information publishing platform, the article is usually provided with a picture, and the quality of the article is improved in a mode of combining pictures and texts. Illustratively, a unique article quality evaluation system is usually constructed by an information publishing platform, and the quality of an article is evaluated from the two aspects of the objective prior experience of the article (including at least one of article typesetting, article matching definition, aesthetic measure, and matching degree of a picture and article content) and the quality of text content. In the related art, the quality of the text is determined by recognizing the basic content of the text by using a text quality recognition model trained in a supervised learning manner, for example, by using a Bidirectional encoding representation model (BERT model) based on a conversion model to determine the quality of the text from the basic content of the text.

In the technical scheme, the text quality recognition model only recognizes the basic content of the text to judge the quality of the text, the recognition dimension is single, and the recognition accuracy of the text quality recognition model is low when the overall quality of the text is recognized.

Disclosure of Invention

The embodiment of the application provides a text quality identification method, a text quality identification device, text quality identification equipment and a storage medium. The text is converted into the co-occurrence relation structure chart, and the co-occurrence relation structure chart is subjected to chart embedding processing by using a random walk algorithm, so that the recognition capability of the text quality recognition model on the text quality is improved. The technical scheme comprises the following steps:

according to an aspect of the present application, there is provided a text quality recognition method, including:

acquiring a text vector of a text, wherein the text vector of the text at least comprises one of a title text vector and a body text vector, the title text vector is a vector corresponding to a title of the text, and the body text vector is a vector corresponding to a body of the text;

obtaining an image vector corresponding to the keyword in the text, wherein the image vector is obtained after the keyword is subjected to image embedding processing;

classifying the text vector and the image vector of the text to obtain the quality grade prediction probability corresponding to the text;

and dividing the quality grade of the text quality according to the quality grade prediction probability.

According to another aspect of the present application, there is provided a text quality recognition apparatus, including:

the acquisition module is used for acquiring a text vector of a text, wherein the text vector of the text at least comprises one of a title text vector and a body text vector, the title text vector is a vector corresponding to a title of the text, and the body text vector is a vector corresponding to a body of the text;

the acquisition module is used for acquiring an image vector corresponding to the keyword in the text, wherein the image vector is obtained after the keyword is subjected to image embedding processing;

the classification module is used for classifying the text vector and the image vector of the text to obtain the quality grade prediction probability corresponding to the text;

and the quality classification module is used for classifying the quality grade of the text quality according to the quality grade prediction probability.

According to another aspect of the present application, there is provided a computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method of text quality recognition as described in the above aspect.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement the text quality recognition method according to the above aspect.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and the processor executes the computer instructions to cause the computer device to perform the text quality recognition method as described above.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the image vectors are obtained by using the keywords in the text, the incidence relation among the keywords in the text is represented by the image vectors, the central thought of the text can be deeply understood by using the incidence relation among the keywords, so that the content of the text is accurately judged, and then the text vectors (such as the characteristics of multiple dimensions of the title text vector and the body text vector) of the text are combined, so that the text quality identification model can accurately identify the overall quality of the text, and further platforms such as a subsequent application program and the like can recommend high-quality texts to a user.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 2 is a flow diagram of a method for text quality recognition provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method of text quality recognition provided by another exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a path of a random walk provided by an exemplary embodiment of the present application;

FIG. 5 is a block diagram of a word skipping model provided by an exemplary embodiment of the present application;

FIG. 6 is a block diagram of a text quality recognition model provided by an exemplary embodiment of the present application;

fig. 7 is a block diagram of a text quality recognition apparatus according to an exemplary embodiment of the present application;

fig. 8 is a schematic device structure diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms related to embodiments of the present application will be described.

Depth Walk model (Deep Walk): the method is a model realized based on a depth walk algorithm. A graph or a network is input into a deep migration model, output is vector representation of a vertex in the network, deep migration learns feature representation of the network by means of truncation Random Walk (Truncated Random Walk), and good effect can be achieved even when the number of network labeled vertices is small. And the algorithm also has the advantage of being extensible and can adapt to the change of the network.

The co-occurrence relationship is as follows: refers to the relationship between target words when at least two target words appear in the text at the same time by analyzing. The relation between the target words is usually determined according to the times of the common appearance of the target words in the same article, and the relation between the words can be found according to word frequency analysis or clustering analysis, so that the theme of the article can be better determined. For example, the character characters appearing in the novel are extracted, and the character relationship between the two character characters is determined according to the number of times of the two character characters appearing simultaneously.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is researched and applied in multiple fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, smart audit texts, smart identification text quality and the like.

The text quality identification method provided by the embodiment of the application can be applied to computer equipment with stronger data processing capacity. In a possible implementation manner, the recognition method of the text quality provided by the embodiment of the application can be applied to a personal computer, a workstation or a server, that is, the training or the use of the text quality recognition model can be realized through the personal computer, the workstation or the server.

The trained text quality recognition model can be realized to be a part of an application program and is installed in the terminal, so that the application program can recognize the text quality, and the text with higher quality is pushed to a user; or the trained text quality recognition model is arranged in a background server of the application program, so that the terminal provided with the application program can recognize the text quality by means of the background server.

FIG. 1 illustrates a schematic diagram of a computer system provided by an exemplary embodiment of the present application. The computer system 100 includes a terminal 110 and a server 120, wherein the terminal 110 and the server 120 perform data communication through a communication network, optionally, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.

The terminal 110 is installed with an application program supporting a text reading function, where the application program may be any one of a browser application program, a news application program, a social contact application program, a web question and answer sharing application program, and a text reading application program, and this is not limited in this embodiment of the present application.

Optionally, the terminal 110 may be a mobile terminal such as a smart phone, a smart watch, a tablet computer, an electronic book reader, a laptop portable notebook computer, and an intelligent robot, or may also be a terminal such as a desktop computer and a projection computer, and the type of the terminal is not limited in this embodiment of the application.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. In one possible implementation, the server 120 is a backend server for applications in the terminal 110.

As shown in fig. 1, in the present embodiment, a browser application is installed and operated in the terminal 110, a recommended article list is displayed on the browser application, and the user browses favorite articles through the browser application. The articles in the recommended article list are screened by the browser application through the text quality recognition model in the server 120, and the server 120 sends the screened articles to the terminal 110, so that the recommended article list is displayed on the browser application.

Illustratively, various articles are stored in the server 120, when a user reads the articles by using a browser application program on the terminal 110, the terminal 110 sends a user account to the server 120, the server 120 selects high-quality articles from the stored articles for the user according to the user account to form a recommended article list, and the recommended article list is sent to the terminal 110.

The server 120 comprises a text quality identification model 10, after the server 120 obtains the text 11, the server 120 extracts the keywords 12 from the text 11, the server 120 determines the extracted keywords 12 as graph nodes, and a co-occurrence relationship structure diagram 13 is constructed according to the graph nodes. The co-occurrence relation structure diagram 13 is input to the depth migration model 14, and graph embedding processing is performed on the co-occurrence relation structure diagram, and a graph vector 15 corresponding to the keyword is output. The image vector 15 and the text vector 16 of the text are input to the text quality recognition model 10, and the text quality level prediction probability 17 is output. The server 120 divides the articles according to the quality level prediction probability and sends the articles with high quality to the terminal 110 to form an article recommendation list.

The text vector 16 of the text at least comprises one of a headline text vector and a body text vector, wherein the headline text vector is a headline text vector corresponding to a headline in an article, and the body text vector is a body text vector corresponding to a body in the article.

For convenience of description, the following embodiments are described as examples in which the recognition method of the text quality is performed by the server.

Fig. 2 shows a flowchart of a text quality recognition method provided by an exemplary embodiment of the present application. The embodiment is described by taking the method as an example for being used in the server 120 in the computer system 100 shown in fig. 1, and the method includes the following steps:

step 201, obtaining a text vector of a text, where the text vector of the text at least includes one of a title text vector and a body text vector, the title text vector is a vector corresponding to a title of the text, and the body text vector is a vector corresponding to a body of the text.

The text comprises at least one of articles, news reports, poems, novels and shared information in a social sharing application program, and the type of the text is not limited in the embodiment of the application.

When the text quality is identified, the quality of the text can be more accurately judged by comprehensively judging a plurality of dimensions of the text. Illustratively, the server includes a corpus in which various types of text are stored. It is understood that the text includes at least one of a title and a body, and the text vector obtained from the text also includes at least one of a title text vector and a body text vector. The embodiments of the present application are described by taking text including a title and a body as examples.

Illustratively, the server extracts a body text vector and a heading text vector from the heading and body of the text, respectively. Illustratively, a fast text classification model (FastText model) is used to process a text to obtain a text vector and a heading text vector. The FastText model is an open-source word vector and text classification tool, and can generate word vector representation of a text in an unsupervised learning mode. For example, the FastText model can learn that "boy," "girl," "boy," "girl" refer to a particular gender and can associate these vectors with related documents. Then, when the user queries in an application (the one that builds the FastText model) (assuming "where is my bag now. Compared with a classification algorithm based on a neural network, the FastText model accelerates the training speed and the testing speed under the condition of keeping high precision, and does not need pre-trained word vectors, and the FastText model can train the word vectors.

It can be understood that the text may further include content such as a tag and a comment, so when a text vector of the text is obtained, a vector corresponding to the tag and a vector corresponding to the comment may also be obtained. If an article published on the public number includes comment content commented by multiple users, a text vector corresponding to the comment content can be extracted according to the comment content.

In some embodiments, the heading text vector and the body text vector may also be extracted from the text by other text vector extraction models.

Step 202, obtaining a graph vector corresponding to the keyword in the text, wherein the graph vector is obtained after the keyword is subjected to graph embedding processing.

The map vector is a map vector obtained by map embedding processing. Extracting keywords in the text, and generating an image vector of the text by using the keywords, namely performing image embedding processing on the text to obtain the image vector corresponding to the keywords in the text.

Before a server acquires a graph vector corresponding to a keyword in a text, a structure diagram corresponding to the text needs to be constructed, the keyword of the text is taken as a Node (Node) of the structure diagram, and a co-occurrence relationship among the keywords is taken as a relationship edge of the structure diagram, so that the co-occurrence structure diagram of the keyword is formed.

Schematically, graph embedding processing is performed on the co-occurrence relation structure diagram corresponding to the text through a deep walking model, namely, words in the text are mapped into word vectors which can be recognized by a computer through the graph embedding processing mode.

And 203, classifying the text vector and the image vector of the text to obtain the quality grade prediction probability corresponding to the text.

Illustratively, a text quality recognition model is constructed in the server, the text quality recognition model is a machine learning model with text recognition capability, and the server judges the quality of the text by combining the content represented by the text and has the capability of predicting and judging the quality of the text. And the server calls a text quality recognition model to classify the text vectors and the image vectors of the text, and the text quality recognition model outputs the prediction probability of the quality level of the text. In other embodiments, the text vector comprises a header text vector and a body text vector, the server calls the text quality recognition model to classify the header text vector, the body text vector and the graph vector, and the text quality recognition model outputs the quality level prediction probability of the text.

And step 204, dividing the quality grade of the text quality according to the quality grade prediction probability.

The quality level prediction probability serves as a rating label for dividing the quality level of the text. Illustratively, the quality of the text is divided into two levels, high quality and low quality, respectively, based on the text quality level prediction probability. In some embodiments, the quality of the text is divided into a good and a non-good; in other embodiments, the quality of the text is divided into a level a, a level B and a level C, where the level a indicates the best quality of the text, the level B indicates the medium quality of the text, and the level C indicates the lower quality of the text. The embodiment of the present application does not limit the specific dividing manner.

Illustratively, a probability threshold is set for the quality level prediction probability, the text with the quality level prediction probability higher than the probability threshold is divided into high-quality text, and the text with the quality level prediction probability lower than the probability threshold is divided into low-quality text.

In summary, according to the method provided by this embodiment, the image vector is obtained by using the keywords in the text, the incidence relation between the keywords in the text is represented by the image vector, the central idea of the text can be deeply understood by using the incidence relation between the keywords, so that the content of the text is accurately determined, and then the text vector (for example, the characteristics of multiple dimensions of the title text vector and the body text vector) of the text is combined, so that the text quality recognition model can accurately recognize the overall quality of the text, and further, platforms such as subsequent application programs and the like can recommend high-quality texts to the user.

Fig. 3 shows a flow chart of a text quality recognition method provided by another exemplary embodiment of the present application. The embodiment is described by taking the method as an example for being used in the server 120 in the computer system 100 shown in fig. 1, and the method includes the following steps:

step 301, obtaining a text vector of a text, where the text vector of the text at least includes one of a heading text vector and a body text vector, the heading text vector is a vector corresponding to a heading of the text, and the body text vector is a vector corresponding to a body of the text.

When the text quality is identified, the quality of the text can be more accurately judged by comprehensively judging a plurality of dimensions of the text. Illustratively, the server includes a corpus in which various types of texts are stored, and it is understood that the texts include at least one of a title and a body, and the text vectors obtained from the texts also include at least one of a title text vector and a body text vector. The embodiments of the present application are described by taking text including a title and a body as examples.

Illustratively, the server extracts a body text vector and a heading text vector from the heading and body of the text, respectively. Illustratively, a fast text classification model (FastText model) is used to process a text to obtain a text vector and a heading text vector. The FastText model is an open-source word vector and text classification tool, and the FstText model can generate word vector representation of a text in an unsupervised learning mode. For example, the FastText model can learn that "boy," "girl," "boy," "girl" refer to a particular gender and can associate these vectors with related documents. Then, when the user queries in an application (the one that builds the FastText model) (assuming "where is my bag now. Compared with a classification algorithm based on a neural network, the FastText model accelerates the training speed and the testing speed under the condition of keeping high precision, and does not need pre-trained word vectors, and the FastText model can train the word vectors.

In some embodiments, the heading text vector and the body text vector may also be extracted from the text by other text vector extraction models; in other embodiments, the title and the body in the text are respectively mapped into a title text vector and a body text vector by means of One-hot encoding (One-hot encoding); in other embodiments, the title and the body in the text are mapped into a title text vector and a body text vector respectively by word embedding (wordledding).

Step 302, extracting keywords from the text.

Illustratively, extracting keywords from the text includes at least one of the following ways: a keyword extraction algorithm based on statistical characteristics, a keyword extraction algorithm based on a word graph model, and a keyword extraction algorithm based on a topic model.

1. The idea of the statistical feature-based keyword extraction algorithm is to extract keywords of a text by using statistical information of words in the text, generally preprocess the text to obtain a set of candidate words, and then obtain the keywords from the candidate set by adopting a characteristic value quantization mode.

2. Extracting keywords based on a word graph model, firstly constructing a language network graph of a text, then analyzing the language network graph, and inquiring words or phrases with important functions on the language network graph, wherein the phrases are keywords of the text. Graph nodes in the language network graph are usually words, and the main forms of the language network graph are divided into four types according to different connection relations of the words: co-occurrence network maps (co-occurrence relationship structure maps), syntactic network maps, semantic network maps, and other network maps.

3. The keyword extraction algorithm based on the topic model mainly utilizes the property of the distribution of the topics in the topic model to extract keywords. Firstly, candidate keywords are obtained from the text, the topic distribution and the candidate keyword distribution of the text are calculated, the topic similarity distribution of the text and the candidate keywords is calculated, and the first n words are selected as the keywords.

Step 303, determining the keyword as a graph node.

In the construction process of the language network graph, the preprocessed words are generally used as graph nodes, and the relationships between the words are used as edges. In a linguistic network graph, the weights between edges are generally expressed in degrees of association between words. When the keyword is obtained by using the language network graph, the importance of each graph node needs to be evaluated, then the graph nodes are ranked according to the importance, and words represented by TopK graph nodes are selected as the keyword (K is a positive integer).

And 304, constructing a co-occurrence relationship structure chart according to the chart nodes, wherein the co-occurrence relationship structure chart is used for representing the co-occurrence relationship among the keywords in the text.

The embodiments of the present application take a co-occurrence structure diagram as an example. And generating a relation edge in the co-occurrence relation structure chart according to the co-occurrence relation of the key words in the text, and constructing the co-occurrence relation structure chart according to the chart nodes and the relation edge.

The co-occurrence relation refers to the relation between target words when at least two target words occur in the text at the same time through analysis. The association between target words is typically determined based on the number of times the target words co-occur in the same article.

Fig. 4 shows a co-occurrence relationship structure diagram provided in an exemplary embodiment of the present application, where circles containing sequence numbers represent graph nodes (nodes) 21 constructed by keywords in text, the graph nodes are connected by relationship edges 22, and the relationship edges 22 are used to represent co-occurrence relationships between the keywords. In some embodiments, the length of the relationship edge indicates the importance of the relationship, for example, a relationship edge between two graph nodes is shorter, indicating that there is a stronger association between the two graph nodes.

And 305, calling a deep migration model to perform graph embedding processing on the co-occurrence relation structure diagram, and outputting a graph vector corresponding to the keyword.

Determining the ith graph node in the co-occurrence relationship structure chart as a root node, wherein the ith graph node corresponds to the ith keyword in the text, and i is a positive integer; acquiring n relation paths corresponding to the root node as an origin, wherein the relation paths correspond to relation edges in the co-occurrence relation structure chart one by one, and n is a positive integer; selecting a target relation path from the n relation paths to carry out random walk processing to obtain a group of word sequences corresponding to the target relation path; and repeatedly executing the three steps until all graph nodes in the co-occurrence relation structure chart are traversed, carrying out graph embedding processing on the word sequence corresponding to the relation path, and outputting a graph vector corresponding to the keyword.

The graph embedding process utilizes the idea of word embedding (word vector), the basic processing element of the word embedding is a word, and correspondingly, the element of the graph embedding process of the co-occurrence relation structure diagram is a graph node; the word embedding is to analyze word sequences in a sentence and embed the word sequences formed by graph nodes in the co-occurrence relation structure chart in a random walk mode. Random Walk (Random Walk) refers to a Random Walk path that is continuously and repeatedly selected in the co-occurrence structure diagram, and finally a path that penetrates through the co-occurrence structure diagram is formed. Starting from a certain graph node, randomly selecting one from the relationship edges connected with the current graph node in each step of walking, moving to the next graph node along the selected relationship edge, and repeating the process continuously.

As shown in fig. 4, the graph node of sequence number 3 is designated to start walking, the route of the graph node pointing to sequence number 4 is selected, the route of the graph node pointing to sequence number 6 is selected, and the graph node walks to sequence number 7, and the entire walking route is graph node 3 → graph node 4 → graph node 6 → graph node 7. It will be appreciated that any graph node may be designated as an initial graph node for the walk.

Defining the co-occurrence structure diagram as G ═ V, E, where V represents the set of graph nodes of the co-occurrence structure diagram, E represents the set of relationship edges of the co-occurrence structure diagram, and

illustratively, if the ith graph node corresponding to the ith keyword is taken as a root node, a path obtained by random walk is (v)₀，v₁，...，v_i) When embedding the ith word, the probability of the ith target word needs to be calculatedP(v_i|(v₀，v₁，...，v_i-1))。

In the process of random walk, the window size needs to be determined, and the window size refers to the unit step length of movement along the target relation path in the process of random walk. Acquiring a target relation path and a window size in a random walk process, wherein the window size corresponds to the number of intercepted graph nodes; and intercepting the front m graph nodes and the back m graph nodes connected with the ith graph node according to the window size along a target relation path to obtain a group of word sequences corresponding to the target relation path, wherein m is a positive integer. As shown in fig. 5, if the window size is 2 with the tth word as the center, two words (w (t-2), w (t-1)) before the tth word and two words (w (t +1), w (t +2)) after the tth word are truncated.

Illustratively, the probability of the ith target Word is calculated by using a Word vector model (Word2vec model), and commonly used Word2vec models include a continuous bag-of-words model (Cbow model) and a Skip-Word model (Skip-gram model), wherein the Cbow model predicts a central Word according to a context, and the Skip-gram model predicts a context according to the central Word, and because the two models are similar, the structure of the Skip-gram model is shown in fig. 5. The way in which the above probabilities are calculated by the Skip-gram model is as follows:

wherein, w_iDenotes the ith keyword, m denotes the window size at random walk time, j denotes the order of the keywords in the word sequence, v_iRepresents the ith graph node corresponding to the ith keyword, | V | represents the set formed by all word sequences,

a transpose matrix representing the word vector (matrix) corresponding to the k-th word,

represents the (i-m + j) th word pairThe transpose of the corresponding word vector (matrix).

And step 306, calling a full connection layer in the text quality recognition model to splice the text vector and the image vector of the text, and outputting a spliced first vector.

As shown in fig. 6, a text vector 16 of a text and a map vector 15 corresponding to a keyword are subjected to a concatenation process 31(Concat), the first vector after the concatenation is input to a text quality recognition model 10, the text quality recognition model includes a plurality of fully connected layers 101 (sense layers), the text vector 16 of the text and the map vector 15 corresponding to the keyword are input to the fully connected layers 101, and the first vector is output by the process of the plurality of fully connected layers 101. The text vector 16 at least includes one of a heading text vector and a body text vector, the heading text vector is a vector corresponding to a heading of the text, and the body text vector is a vector corresponding to a body of the text.

And 307, calling a discarding layer in the text quality recognition model to process the first vector and output a second vector, wherein the discarding layer is a neural network layer used for filtering useless feature vectors from the first vector.

The text quality recognition model 10 includes a drop layer (Dropout layer) 102, where the drop layer 102 is invoked to process a first vector and output a second vector. During forward propagation, the activation value of a certain neuron stops working with a certain probability p, so that the model generalization is stronger because the model does not depend too much on some local features. The proportion of neurons that stop working is controlled by the discard rate, which is the ratio of the discarded neuron nodes of the layer to the total neuron nodes.

And 308, calling a logistic regression layer in the text quality identification model to process the second vector, and outputting the quality grade prediction probability corresponding to the text.

The text quality identification model 10 comprises a logistic regression layer (sigmoid layer) 103, and the logistic regression layer 103 is called to process the second vector and output the quality grade prediction probability corresponding to the text. The numerical value output by the Sigmoid layer is between 0 and 1, the Sigmoid layer is adopted to output a prediction probability in the two classification tasks, and when the output prediction probability meets a certain condition, the input corresponding to the prediction probability is classified into a positive class. For example, if a probability threshold of 0.7 is set for the quality level prediction probability and the quality level prediction probability output by the logistic regression layer 103 is 0.8, the quality level of the text corresponding to the quality level prediction probability is high (high-quality text).

The quality class prediction probability is calculated by the following formula:

wherein, c_iThe i-th keyword is shown, v is a graph node in the co-occurrence relationship structure diagram, Z is a constant, and E is a set of relationship edges in the co-occurrence relationship structure diagram.

And 309, dividing the quality grade of the text quality according to the quality grade prediction probability.

In summary, in the method of this embodiment, the image vector is obtained by using the keywords in the text, the incidence relation between the keywords in the text is represented by the image vector, the central idea of the text can be deeply understood by using the incidence relation between the keywords, so that the content of the text is accurately determined, and then the text vector (for example, the characteristics of multiple dimensions of the title text vector and the body text vector) of the text is combined, so that the text quality recognition model can accurately recognize the overall quality of the text, and further, platforms such as subsequent application programs and the like can recommend high-quality texts to the user.

The method comprises the steps of extracting keywords from a text, constructing a co-occurrence relation structure diagram according to the keywords and the co-occurrence relation between the keywords, converting the text into the co-occurrence relation structure diagram, representing the content of the text according to the co-occurrence relation structure diagram, and generating word vectors of the text according to the co-occurrence relation structure diagram by using a deep migration model.

And constructing a co-occurrence relation structure diagram by combining the relation edges corresponding to the co-occurrence relation among the keywords and the nodes corresponding to the keywords, and more intuitively determining the relation among the keywords through the co-occurrence relation structure diagram.

By using any graph node as a root node to carry out random walk processing, graph nodes in the co-occurrence relation structure chart can be traversed to form a keyword sequence corresponding to the graph nodes, and keywords can be accurately mapped into word vectors through subsequent word embedding processing.

And intercepting the graph nodes by using the window size to form a word sequence, thereby ensuring that the word sequence extracted from the co-occurrence relation structure graph is correct, and ensuring that the word sequence is also correct when being mapped into a word vector subsequently.

The text quality recognition model can accurately recognize the text quality by setting a full connection layer, a discarding layer and a logistic regression layer in the text quality recognition model.

In an alternative embodiment based on fig. 3, the deep walking model is trained as follows:

step 320, obtaining a sample text, wherein the sample text corresponds to the real sample image vector.

Illustratively, the sample text is stored in a corpus of the server, or the sample text is a text stored in the terminal, and the sample text is transmitted by the terminal to the server.

Illustratively, the step 320 may be replaced by the following steps by selecting positive sample text and negative sample text from the sample text in combination with the text evaluation data:

step 3201, text evaluation data is obtained, and the text evaluation data includes at least one of reading amount, review amount, forwarding amount, and approval amount of the text.

The server stores sample texts and corresponding text evaluation data. Illustratively, the text evaluation data of the sample text includes the reading amount and the evaluation amount of the text, for example, the evaluation amount of an article published in public is 10350, the reading amount is 10 ten thousand, the evaluation amount of an article published in another public is 2300, and the reading amount is 3 ten thousand.

Step 3202, a positive sample text and a negative sample text are selected from the sample texts according to the text evaluation data.

Schematically, setting corresponding preset conditions for text evaluation data, wherein if the articles with the comment amount exceeding 1 ten thousand are high-quality articles, or if the reading amount exceeds 5 ten thousand, the articles are high-quality articles; and the articles with the appraisal amount or the reading amount not exceeding the preset condition are non-quality articles. For another example, an article with a review amount of more than 1 ten thousand but a reading amount of less than 5 ten thousand is a medium-quality article. The embodiment of the present application does not limit the setting manner of the evaluation criterion. The positive sample text is a sample text with a good quality grade, and the negative sample text is a sample text with a bad quality grade.

The sample text used for training corresponds to the sample map vector that is generated realistically. The sample map vector is generated by other trained deep walk models.

Step 321, extracting sample keywords from the sample text, where the sample keywords correspond to sample graph nodes in the sample co-occurrence relationship structure diagram one to one.

The keywords of the sample text may be extracted in the above embodiment, or extracted in a manner of manual reading, which is not limited in the embodiment of the present application. And taking the sample keywords in the sample text as the sample graph nodes in the sample co-occurrence relationship structure chart.

And 322, constructing a sample co-occurrence relationship structure diagram according to the sample diagram nodes, wherein the sample co-occurrence relationship structure diagram is used for representing co-occurrence relationships among sample keywords in the sample text.

And generating a relationship edge in the sample co-occurrence relationship structure diagram according to the co-occurrence relationship among the sample keywords in the sample text, and constructing the sample co-occurrence relationship structure diagram according to the sample diagram nodes and the relationship edge determined according to the sample keywords in the step 321.

Step 323, inputting the sample co-occurrence relation structure diagram into the depth migration model, and outputting a prediction sample diagram vector corresponding to the sample keyword.

And splicing (Concat) the sample co-occurrence relation structure chart and the title text vector extracted from the title of the sample text with the text vector of the text of the sample text, inputting the spliced vector into the depth migration model, and outputting the predicted sample chart vector corresponding to the sample keyword through a plurality of full connection layers, discarding layers and logistic regression layers.

And 324, training the depth migration model according to the predicted sample image vector and the real sample image vector to obtain the trained depth migration model.

And calculating an error result between the sample image vector and the real image vector by using the error function, and training the depth walking model by using a back propagation algorithm according to the error result.

In summary, in the method of the embodiment, the positive sample and the negative sample which are more suitable for training are selected from the sample text through the text evaluation data, so that the text quality recognition model can be comprehensively trained, and the trained text quality recognition model can accurately recognize the quality of the text.

In one example, a user reads an article in the sharing application, a terminal used by the user sends a reading request to a server, and the server recommends a high-quality article to the user according to the method provided in the above embodiment.

The method for judging the quality of the article generally considers two aspects of the objective prior experience of the article (including at least one of article typesetting, article matching definition, aesthetic degree and matching degree of matching between the matching and the article content) and the text content quality of the article. The text content quality can be divided into two measurement dimensions of article basic content quality and article content attraction. The embodiment of the application is explained aiming at the dimension of the attraction of the article content.

Firstly, a server extracts keywords of all articles from a corpus to construct Node nodes (Graph nodes), constructs a co-occurrence relation Graph through co-occurrence relations of the keywords in the same article, samples the co-occurrence relation structure Graph through Random Walk and trains Graph Embedding (Graph Embedding processing) of the Graph nodes by using a Skip-gram model. After Graph Embedding based on article keywords and FastText pre-training word vectors based on article titles and texts are trained, attractive positive and negative samples are screened according to posterior consumption data (text evaluation data) of the text contents by a user, text vectors generated by FastText pre-training of the titles and the texts are spliced (Concat), the image vectors generated by the article keywords are used as model input, and finally, a plurality of layers of Dense layers are connected for classification, so that the recognition scheme of high-quality image-text attraction is completed. As shown in fig. 6, the quality level prediction probability (quality level label) of the article is output at the logistic regression layer, and the quality level prediction probability is input to the high-quality attraction model 32, so that whether the article is attractive to the user can be judged subsequently. In some embodiments, the quality level prediction probability of the article can also be input into other models, so that information of interest to the user can be accurately recommended subsequently.

Through tests, compared with the method used in the prior art, the method provided by the embodiment of the application has the advantages that the text quality identification accuracy rate reaches 95.86%, the image-text high-quality content coverage rate reaches 16.8%, the overall high-quality exposure proportion is improved by 0.5%, the total efficiency is improved by 0.4%, the total duration is improved by 0.32%, and the push efficiency is improved by 0.28%.

The method provided by the embodiment of the application can be applied to a content processing link of a content center, a server scores content quality of all image-text content and distributes the image-text content to an end side, the end side performs hierarchical recommendation weighting according to the content quality scores, for example, recommendation weighting is performed on identified high-quality content, recommendation weight reduction is performed on low-quality content, and the like.

Fig. 7 is a block diagram illustrating a structure of a text quality recognition apparatus according to an exemplary embodiment of the present application, where the apparatus includes:

the obtaining module 710 is configured to obtain a text vector of a text, where the text vector of the text at least includes one of a title text vector and a body text vector, the title text vector is a vector corresponding to a title of the text, and the body text vector is a vector corresponding to a body of the text;

the obtaining module 710 is configured to obtain an image vector corresponding to the keyword in the text, where the image vector is obtained after the keyword is subjected to image embedding processing;

the classification module 720 is configured to classify the text vectors and the image vectors of the text to obtain a quality level prediction probability corresponding to the text;

and the quality classification module 730 is configured to classify the quality grade of the text quality according to the quality grade prediction probability.

In an alternative embodiment, the apparatus includes a processing module 740;

the obtaining module 710 is configured to extract keywords from a text; determining the keywords as graph nodes;

the processing module 740 is configured to construct a co-occurrence relationship structure diagram according to the graph nodes, where the co-occurrence relationship structure diagram is used to represent co-occurrence relationships among the keywords in the text; and calling a depth migration model to perform graph embedding processing on the co-occurrence relation structure graph, and outputting a graph vector corresponding to the keyword.

In an optional embodiment, the processing module 740 is configured to determine an ith graph node in the co-occurrence relationship structure diagram as a root node, where the ith graph node corresponds to an ith keyword in the text, and i is a positive integer; the obtaining module 710 is configured to obtain n relationship paths corresponding to an origin point of the root node, where the relationship paths correspond to relationship edges in the co-occurrence relationship structure diagram one to one, and n is a positive integer; the processing module 740 is configured to select a target relationship path from the n relationship paths to perform random walk processing, so as to obtain a group of word sequences corresponding to the target relationship path; and repeatedly executing the three steps until all graph nodes in the co-occurrence relation structure chart are traversed, carrying out graph embedding processing on the word sequence corresponding to the relation path, and outputting a graph vector corresponding to the keyword.

In an optional embodiment, the processing module 740 is configured to obtain a target relationship path and a window size in the random walk process, where the window size corresponds to the number of intercepted graph nodes; and intercepting the front m graph nodes and the back m graph nodes connected with the ith graph node according to the window size along the target relation path to obtain a group of word sequences corresponding to the target relation path, wherein m is a positive integer.

In an optional embodiment, the processing module 740 is configured to generate a relationship edge in the co-occurrence relationship structure diagram according to the co-occurrence relationship of the keyword in the text; and constructing a co-occurrence relationship structure diagram according to the graph nodes and the relationship edges.

In an optional embodiment, the processing module 740 is configured to invoke a full connection layer in the text quality recognition model to perform a splicing process on a text vector and a graph vector of a text, and output a first vector after the splicing; calling a discarding layer in the text quality recognition model to process the first vector and output a second vector, wherein the discarding layer is a neural network layer used for filtering useless feature vectors from the first vector; and calling a logistic regression layer in the text quality recognition model to process the second vector, and outputting the quality grade prediction probability corresponding to the text.

In an alternative embodiment, the apparatus includes a training module 750;

the training module 750 is configured to obtain a sample text, where the sample text corresponds to a real sample diagram vector; extracting sample keywords from the sample text, wherein the sample keywords correspond to sample graph nodes in the sample co-occurrence relation structure chart one by one; constructing a sample co-occurrence relation structure diagram according to the sample diagram nodes, wherein the sample co-occurrence relation structure diagram is used for representing co-occurrence relations among sample keywords in a sample text; inputting the sample co-occurrence relation structure chart into a depth migration model, and outputting a predicted sample chart vector corresponding to a sample keyword; and training the depth migration model according to the predicted sample image vector and the real sample image vector to obtain the trained depth migration model.

In an optional embodiment, the training module 750 is configured to obtain text evaluation data, where the text evaluation data includes at least one of a reading amount, a comment amount, a forwarding amount, and a like amount of a text; and selecting a positive sample text and a negative sample text from the sample texts according to the text evaluation data.

In summary, in the apparatus provided in this embodiment, the image vector is obtained by using the keywords in the text, the image vector represents the association relationship between the keywords in the text, the central idea of the text can be deeply understood by using the association relationship between the keywords, so as to accurately judge the content of the text, and then the text vector (for example, the characteristics of multiple dimensions of the title text vector and the body text vector) of the text is combined, so that the text quality recognition model can accurately recognize the overall quality of the text, and further, platforms such as subsequent application programs and the like can recommend high-quality texts to the user.

The positive sample and the negative sample which are more suitable for training are selected from the sample text through the text evaluation data, so that the text quality recognition model can be comprehensively trained, and the trained text quality recognition model can accurately recognize the quality of the text.

It should be noted that: the text quality recognition device provided in the above embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the text quality recognition device provided by the above embodiment and the text quality recognition method embodiment belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.

Fig. 8 shows a schematic structural diagram of a server according to an exemplary embodiment of the present application. The server may be the server 120 in the computer system 100 shown in fig. 1.

The server 800 includes a Central Processing Unit (CPU) 801, a system Memory 804 including a Random Access Memory (RAM) 802 and a Read Only Memory (ROM) 803, and a system bus 805 connecting the system Memory 804 and the Central Processing Unit 801. The server 800 also includes a basic Input/Output System (I/O) 806 to facilitate transfer of information between devices within the computer, and a mass storage device 807 for storing an operating System 813, application programs 814, and other program modules 815.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse, keyboard, etc. for user input of information. Wherein a display 808 and an input device 809 are connected to the central processing unit 801 through an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Computer-readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other Solid State Memory technology, CD-ROM, Digital Versatile Disks (DVD), or Solid State Drives (SSD), other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory.

According to various embodiments of the present application, server 800 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 800 may be connected to the network 812 through the network interface unit 811 coupled to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

In an alternative embodiment, a computer device is provided that includes a processor and a memory having at least one instruction, at least one program, set of codes, or set of instructions stored therein, which is loaded and executed by the processor to implement the text quality recognition method as described above.

In an alternative embodiment, a computer readable storage medium is provided having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement a text quality recognition method as described above.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are for description only and do not represent the merits of the embodiments.

Embodiments of the present application also provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and the processor executes the computer instructions to cause the computer device to perform the text quality recognition method as described above.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is intended to be exemplary only, and not to limit the present application, and any modifications, equivalents, improvements, etc. made within the spirit and scope of the present application are intended to be included therein.

Claims

1. A method for recognizing text quality, the method comprising:

2. The method according to claim 1, wherein the obtaining of the graph vector corresponding to the keyword in the text comprises:

extracting the keywords from the text;

determining the keywords as graph nodes;

constructing a co-occurrence relationship structure diagram according to the diagram nodes, wherein the co-occurrence relationship structure diagram is used for representing co-occurrence relationships among the keywords in the text;

and calling a depth migration model to carry out the graph embedding processing on the co-occurrence relation structure graph and outputting a graph vector corresponding to the keyword.

3. The method according to claim 2, wherein the invoking of the deep migration model to perform the graph embedding process on the co-occurrence structure diagram and output the graph vector corresponding to the keyword includes:

determining an ith graph node in the co-occurrence relationship structure chart as a root node, wherein the ith graph node corresponds to an ith keyword in the text, and i is a positive integer;

acquiring n relation paths corresponding to the root node as an origin, wherein the relation paths correspond to relation edges in the co-occurrence relation structure chart one by one, and n is a positive integer;

selecting a target relation path from the n relation paths to carry out random walk processing to obtain a group of word sequences corresponding to the target relation path;

and repeatedly executing the three steps until all the graph nodes in the co-occurrence relation structure graph are traversed, carrying out graph embedding processing on the word sequence corresponding to the relation path, and outputting a graph vector corresponding to the keyword.

4. The method according to claim 3, wherein the selecting a target relationship path from the n relationship paths to perform random walk processing to obtain a group of word sequences corresponding to the target relationship path includes:

acquiring the target relation path and the window size in the random walk process, wherein the window size corresponds to the number of intercepted graph nodes;

and intercepting the front m graph nodes and the back m graph nodes connected with the ith graph node according to the window size along the target relation path to obtain a group of word sequences corresponding to the target relation path, wherein m is a positive integer.

5. The method of claim 2, wherein constructing a co-occurrence graph structure from the graph nodes comprises:

generating a relation edge in the co-occurrence relation structure diagram according to the co-occurrence relation of the key words in the text;

and constructing the co-occurrence relationship structure chart according to the graph nodes and the relationship edges.

6. The method according to any one of claims 1 to 5, wherein the classifying the text vector and the graph vector of the text to obtain the prediction probability of the quality level corresponding to the text comprises:

calling a full connection layer in a text quality recognition model to splice the text vector and the graph vector of the text, and outputting a spliced first vector;

calling a discarding layer in the text quality recognition model to process the first vector and output a second vector, wherein the discarding layer is a neural network layer used for filtering useless feature vectors from the first vector;

and calling a logistic regression layer in the text quality recognition model to process the second vector, and outputting the quality grade prediction probability corresponding to the text.

7. The method of claim 2, wherein the deep walk model is trained by:

obtaining a sample text, wherein the sample text corresponds to a real sample image vector;

extracting sample keywords from the sample text, wherein the sample keywords correspond to sample graph nodes in a sample co-occurrence relation structure chart one by one;

constructing the sample co-occurrence relation structure diagram according to the sample diagram nodes, wherein the sample co-occurrence relation structure diagram is used for representing co-occurrence relations among sample keywords in the sample text;

inputting the sample co-occurrence relation structure diagram into the depth migration model, and outputting a predicted sample diagram vector corresponding to the sample keyword;

and training the depth migration model according to the predicted sample image vector and the real sample image vector to obtain the trained depth migration model.

8. The method of claim 7, wherein obtaining sample text comprises:

acquiring text evaluation data, wherein the text evaluation data comprises at least one of reading amount, evaluation amount, forwarding amount and praise amount of the text;

and selecting a positive sample text and a negative sample text from the sample texts according to the text evaluation data.

9. An apparatus for recognizing text quality, the apparatus comprising:

10. A computer device comprising a processor and a memory, said memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by said processor to implement a method of text quality recognition according to any one of claims 1 to 8.

11. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of text quality recognition according to any one of claims 1 to 8.