CN114254645A

CN114254645A - Artificial intelligence auxiliary writing system

Info

Publication number: CN114254645A
Application number: CN202011002905.4A
Authority: CN
Inventors: 艾浒; 张楠
Original assignee: Beijing Bailing Internet Technology Co ltd
Current assignee: Beijing Bailing Internet Technology Co ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2022-03-29

Abstract

The invention discloses an artificial intelligence auxiliary writing system which comprises a writing system, wherein the writing system comprises an information processing module, a word vector semantic module, a sentence vector semantic module and a sentence vector matrix module, the word vector semantic module comprises a CBOW model neural network training module, the information processing module comprises an information collecting module, a text box input module and a text box output module, the sentence vector semantic module comprises a sentence vector combination algorithm, and the sentence vector matrix module comprises a semantic matrix association algorithm. The invention can convert a section of text or sentence into data which can be stored and calculated by a computer by creating a new sentence meaning algorithm, has more ideality compared with the traditional word meaning calculation, can output similar texts aiming at the input text of a user according to the similar operation among the sentence meanings, realizes the beneficial effect of assisting the text writing, and increases the self-checking and the comparison of the user to the text writing.

Description

Artificial intelligence auxiliary writing system

Technical Field

The invention relates to the field of machine learning, in particular to an artificial intelligence auxiliary writing system.

Background

For complexModeling is carried out on a miscellaneous natural language task, a probability model technology is used at first, but when a joint probability function of a language model is learned, a fatal dimension disaster problem exists. If the lexicon size of the language model is 100000 and the one-hot encoding represents the joint distribution of 10 consecutive words, the total number of parameters of the depth model may be 10⁵⁰And (4) respectively. Accordingly, the number of samples required for a model with sufficient confidence increases exponentially. To solve this problem, Hinton et al originally proposed Distributed Representation (Distributed Representation) in 1986, the basic idea being to represent words as n-dimensional continuous real vectors. The distributed representation has strong characteristic representation capability, n-dimensional vectors and k values in each dimension can represent kⁿAnd (4) a feature. Common open source, trained word vector models typically have n in hundreds or even thousands of dimensions. A common Word vector training mode is CBOW (Continuous Bag-of-Word Model).

Word vectors are the basis of NLP deep learning studies, since semantically similar words tend to appear in similar contexts. Thus, during the learning process, these vectors strive to capture neighboring features of words, and thus learn similarities between words. Compared with characters, the word vector has the advantage of being computable, so that the similarity between words can be measured by calculating cosine distance, Euclidean distance and the like. But the method has no capability in the aspects of sentence semantic similarity and article similarity.

In addition, in the natural language understanding aspect, the existing natural language understanding technology (NLP) does not understand and memorize literary works and various texts, cannot realize associative calculation according to the texts input by the user, does not return high-quality texts with close semantics, and cannot achieve the purposes of assisting the user in association, indexing classics and writing. For example, the latest natural language understanding ERNIE 2.0 technology of hundred degree company in 2020 comprehensively and significantly surpasses the world leading technology in 16 public data sets such as emotion analysis, text matching, natural language reasoning, lexical analysis, reading understanding, intelligent question answering and the like, but does not realize associative calculation according to the text input by the user, high-quality texts with close semantics cannot be returned, and the training cost is high.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an artificial intelligence auxiliary writing system.

In order to solve the technical problems, the invention provides the following technical scheme:

the invention discloses an artificial intelligence auxiliary writing system, which comprises a writing system, wherein the writing system comprises an information processing module, a word vector semantic module, a sentence vector semantic module and a sentence vector matrix module, the word vector semantic module comprises a CBOW model neural network training module, the information processing module comprises an information collecting module, a text box input module and a text box output module, the sentence vector semantic module comprises a sentence vector combination algorithm, and the sentence vector matrix module comprises a semantic matrix association algorithm, and the writing system specifically comprises the following steps:

A. collecting a large number of literary works through an information processing module, and converting characters into character strings after segmentation to form a character paragraph library;

B. b, processing the character paragraphs acquired in the step A through a word vector semantic module, firstly segmenting the character paragraphs, then sequentially processing the words through a CBOW model neural network training module to obtain word vectors of all the words, and then combining all the word vectors to form a phrase vector;

C. b, the phrase vector library in the step B is integrally arranged in a sentence vector semantic module, and a word vector is output as a sentence vector through a sentence vector combination algorithm, so that sentences of the text paragraphs are mainly expressed through the sentence vector;

D. b, after each paragraph in the text paragraph library generated in the step A passes through the step B, C, obtaining a sentence characteristic vector of each text paragraph, expressing the characteristic sentence vector of the sentence by adopting a floating point type, and combining all sentence characteristic vectors to form a literary work matrix library;

E. a user inputs a target text through a text box input module of the information processing module, and after the text is converted into a character string, a target sentence vector is formed through the step B and the step C;

F. and D, processing the target sentence vectors and the literary work matrix library in the step D through a semantic matrix association algorithm of the sentence vector matrix module to obtain a similar sentence vector set, outputting the similar sentence vector set to a text box output module of the information processing module, and arranging the similar sentence vectors in an ascending order according to a similarity rate.

As a preferred technical scheme of the invention, the information processing module comprises a network crawler technology or a network API platform external port and is mainly used for extracting literary work information.

As a preferred technical solution of the present invention, the CBOW model neural network training module is mainly used under a word2vec bag of words algorithm model, the training process of the CBOW model neural network training module is to extract some literature sentences from a large number of literature sentences as training data, extract a phrase w (t) for each sentence, predict w (t) through context words w (t-2), w (t-1), w (t +1), w (t +2), and the trained CBOW model neural network training module can quantize word strings, and comprises the following steps:

(1) inputting the one-hot coding of the context word of the current word into an input layer, wherein the dimension of the one-hot coding is 1 × V, a matrix W1 is set, the dimension of W1 is V × N, V is the total number of word groups contained in a dictionary, and N is a user-defined dimension;

(2) multiplying the context words by the same matrix W1 to obtain respective vectors 1N of the context words, averaging the 1N vectors into a vector 1N, and finally multiplying the average vector 1N by the matrix W2 to become 1V, wherein the dimension of W2 is N V;

(3) and (3) normalizing the 1-V vector, taking out the probability vector of each word, taking the word corresponding to the number with the maximum probability value as a predicted word W (t), calculating errors of the predicted word W (t) and a real expected word W (t), performing reverse propagation gradient descent to adjust matrix values of W1 and W2, and finally obtaining a W1 matrix value which is a word vector library of the literature sentence.

As a preferred technical solution of the present invention, the sentence vector combination algorithm is calculated based on a CBOW model neural network training module, and a sentence vector is formed by a word vector obtained by the CBOW model neural network training module, specifically, the sentence vector is formed by the word vector obtained by the CBOW model neural network training moduleThe method comprises the following steps: setting n words contained in the target sentence A according to the obtained word vectors, wherein each word is represented by an m-dimensional word vector in a word vector library, and the set of word vectors contained in the sentence A is X (X)₁，X₂……X_n) Where each word vector may be represented as:

X₁＝[X₁₁，X₁₂……X_1m]

X₂＝[X₂₁，X₂₂……X_2m]

……

X_n＝[X_n1，X_n2……X_nm]

if the semantic feature vector of sentence a is Avec, the algorithm of Avec is:

Avec＝[(X₁₁+X₂₁+……+X_n1)/n,(X₁₂+X₂₂+……+X_n2)/n，……，(X_1m+X_2m+……+X_nm)/n]for the sake of simplifying the representation:

Y₁＝(X₁₁+X₂₁+……+X_n1)/n

Y₂＝(X₁₂+X₂₂+……+X_n2)/n

……

Y_m＝(X_1m+X_2m+……+X_nm)/n

the semantic feature vector Avec of sentence a ═ Y₁,Y₂,……，Y_n]And obtaining a sentence vector Avec, wherein the data type of Y is a floating point number, so that after a plurality of sentence vectors are collected, the total number of sentences is set to be S, and a floating point type matrix obtained according to the sentence vectors is expressed as:

the output matrixes are combined to form a literary work matrix library G.

As a preferred technical scheme of the invention, the semantic association algorithm mainly combines and calculates a target text and a literary work matrix library, comprises a Euclidean distance formula, and comprises the following steps:

setting the target text input by the user in the step E as an X text, and obtaining X (X) from the n-dimensional feature vector of the X text through the step B and the step C₁，X₂……X_n) The term "X" is used herein to refer to a set of X texts of a target text, which is substantially different from the aforementioned lexicon vector X, and X is a concept of X texts, and a comparison sentence is Y (Y)₁,Y₂,……，Y_n) Then the multi-dimensional corresponding formula is:

the distance between the X text and the plurality of sentence feature vectors can be calculated by sequentially calculating the distance between the X text and millions of sentence feature vectors stored by a program, namely the similarity between sentences, and finally sequencing the similar sentences.

As a preferred technical solution of the present invention, the semantic association algorithm includes an algorithm simplification process, and includes the following steps:

first defining a transformation matrix of m rows

Multiplying the m-row transformation matrix by the corresponding sentence vector X text to obtain:

combining all the memorized sentence characteristic vectors into a matrix G, wherein the matrix G is a matrix in which m rows and n columns are recorded, namely, m sentences are memorized by the algorithm, and each sentence characteristic vector is n;

d ═ X '-G, where X' is the X text input by the user, is a matrix of one row and n columns, and is converted into a matrix X 'of m rows and n columns by the transformation matrix C, and the difference between the sentence vector of the X text and all the sentence vectors in the literary work matrix library is obtained by subtracting X' from the matrix G to obtain a matrix D;

E＝D⊙D，

wherein an operator "", is a hadamard product, which is a matrix operation, if a ═ is (a ═_ij) And B ═ B_ij) Are two matrices of the same order, if c_ij＝a_ij×b_ijThen, the matrix C is called (C)_ij) The Hadamard product is the Hadamard product of A and B, or called basic product, so that in the formula, E is the matrix D and the Hadamard product is made by itself, namely, all elements in the matrix D are squared;

finally F ═ E^TC, wherein E^TFor the transformed E and C, a transformation matrix is obtained, the obtained F is a matrix with m rows and one column, the numerical value in the matrix is the similarity of the X text and each sentence, a sentence list which is most similar to the X text can be obtained after ascending arrangement, and the original European formula has no evolution, so the final formula is as follows:

associative distance

As a preferred technical solution of the present invention, the semantic matrix association algorithm is mainly set on a GPU for performing operations.

Compared with the prior art, the invention has the following beneficial effects:

1: the invention can convert a section of text or sentence into data which can be stored and calculated by a computer by creating a new sentence meaning algorithm, has more ideality compared with the traditional word meaning calculation, can output similar texts aiming at the input text of a user according to the similar operation among the sentence meanings, realizes the beneficial effect of assisting the text writing, and increases the self-checking and the comparison of the user to the text writing.

2: the invention changes the single-thread operation mode into the matrix operation mode, realizes that the semantic calculation time can be changed from m times to 1 time in a short time through the high-efficiency matrix operation in the GPU, and greatly improves the efficiency of sentence meaning operation.

3: after the matrix operation mode is realized, the cost of deep learning required by single-thread operation can be reduced, and the machine learning efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram of the system architecture of the present invention;

FIG. 2 is a schematic flow diagram of the present invention;

FIG. 3 is a schematic diagram of the target text output of the present invention;

Detailed Description

The following description of the preferred embodiments of the present invention is provided for the purpose of illustration and description, and is in no way intended to limit the invention.

Example 1

As shown in fig. 1-3, the present invention provides an artificial intelligence auxiliary writing system, which comprises a writing system, wherein the writing system comprises an information processing module, a word vector semantic module, a sentence vector semantic module and a sentence vector matrix module, the word vector semantic module comprises a CBOW model neural network training module, the information processing module comprises an information collecting module, a text box input module and a text box output module, the sentence vector semantic module comprises a sentence vector combination algorithm, and the sentence vector matrix module comprises a semantic matrix association algorithm, and specifically comprises the following steps:

Furthermore, the information processing module comprises a network crawler technology or a network API platform external port and is mainly used for extracting literary work information.

The CBOW model neural network training module is mainly used under a word2vec bag of words algorithm model, the training process of the CBOW model neural network training module is that some literature sentences are extracted from a large number of literature sentences to be used as training data, word groups W (t) are extracted from each sentence, and word strings and words can be quantized through context words w (t-2), w (t-1), w (t +1) and w (t +2) to predict W (t), and the training process comprises the following steps:

The sentence vector combination algorithm is calculated based on a CBOW model neural network training module, and a sentence vector is formed by word vectors obtained by the CBOW model neural network training module, and the method specifically comprises the following steps: setting n words contained in the target sentence A according to the obtained word vectors, wherein each word is represented by an m-dimensional word vector in a word vector library, and the set of word vectors contained in the sentence A is X (X)₁，X₂……X_n) Where each word vector may be represented as:

X₁＝[X₁₁，X₁₂……X_1m]

X₂＝[X₂₁，X₂₂……X_2m]

……

X_n＝[X_n1，X_n2……X_nm]

if the semantic feature vector of sentence a is Avec, the algorithm of Avec is:

Y₁＝(X₁₁+X₂₁+……+X_n1)/n

Y₂＝(X₁₂+X₂₂+……+X_n2)/n

……

Y_m＝(X_1m+X_2m+……+X_nm)/n

the semantic feature vector Avec of sentence a ═ Y₁,Y₂,……，Y_n]And obtaining a sentence vector Avec, wherein the data type of Y is a floating point number, so that after a plurality of sentence vectors are collected, the total number of sentences is set to be S, and the sentence vector Avec is obtained according to the sentence vectorsThe floating point type matrix of (d) is then expressed as:

the output matrixes are combined to form a literary work matrix library G.

The semantic association algorithm mainly combines the target text and the literary work matrix library for calculation, comprises an Euclidean distance formula, and comprises the following steps:

The semantic association algorithm comprises an algorithm simplification process, and comprises the following steps:

first defining a transformation matrix of m rows

E＝D⊙D，

associative distance

The semantic matrix association algorithm is mainly arranged on the GPU for operation.

Specifically, according to the above description, the present application mainly provides a sentence meaning algorithm of a text, which can convert a target text into a sentence vector that can be identified by a program, combine literary works into a database according to the characteristics of the sentence vector, form a comparison between a single sentence vector and a literary work database according to an association algorithm, thereby finding out a most similar sentence, where the sentence vector is based on an existing word vector, convert semantics in the sentence into topics of phrases, topic weights, and keywords included in a main body through a CBOW model neural network training module based on a word2vec algorithm, index the phrases into the sentences according to the keywords, form a phrase matrix vector that is labeled by the word vector and combined into the target sentence, and convert the phrase matrix into the sentence vector according to the existing phrase matrix vector, that is to convert each phrase in the sentences into the word vector, the word senses of the word vectors are superposed, each word vector set is a 1-N single-column set, so that a superposed sentence matrix can be converted into the 1-N single-column set, a plurality of phrase combination word senses are superposed into a sentence vector, the sentence vector is combined by a plurality of phrases to form expression, finally, the sentence vectors containing semantic features are combined into a floating-point type matrix, the matrix and the original text are stored in a server, the sentence feature matrix is a literature matrix library, the matrix mainly feeds back semantic features of sentences, so that a computer can understand the meaning of the original text through the sentence sense matrix, and the corresponding original text is output mainly through the sentence sense matrix when the computer outputs, so that the corresponding output effect can be achieved.

When the association is output, the text to be associated is mainly converted into sentence vectors through the description of the steps, the basis of the association output mainly relates to an Euclidean association formula, the distances between the X text and a plurality of sentence feature vectors are calculated according to the association formula, namely, the distances (also called similarities) are sequentially calculated between the X text and millions of sentences stored by a program, namely, the sentence feature vectors in the literature matrix library explained above are sequentially calculated, and then the similarity between the sentences and the X text is sequenced to see which sentences are close to each other, so that which sentences are most similar to the semantics of the X text is known, the similarity between the sentences can be accurately obtained, the meaning of the output sentences is ensured to be the same, as shown in FIG. 3, no phrase association result about "known sound" exists in the literature sentences output by the sentences of the "known sound" known, that is, after matching, the formed matching relationship is not the original "keyword matching" but "character string matching", and is mainly output according to the matching relationship of sentence vectors, which also shows that the invention can understand and memorize the "declarative knowledge", and uses a matrix association algorithm to simulate a human brain association mechanism to realize the understanding and association of user semantics.

If the number of sentences in the matrix library of the literature works reaches tens of millions, the sentence traversing is carried out for a long time, so as to improve the operation efficiency of the sentence meaning, the single-cycle association operation is improved into the matrix operation mode again, the time required for operation is changed from m times to one time, the mode is mainly based on the semantic matrix association algorithm described by the invention, the X text is converted into the matrix with the same dimension as the matrix library of the literature works, the difference value of the two matrixes is combined, then the final single-column matrix set is obtained through conversion according to a formula, the data in the set is the similarity of the X text and the sentence vector in the matrix library of the literature works, the similar numerical value represents the similarity of the sentence vector in the matrix library of the literature works, the output result shown in figure 3 can be quickly obtained through ascending output during output, and the adopted matrix association algorithm, the human brain association mechanism is simulated, the understanding and association of user semantics are realized, the semantic association mechanism can be rapidly extended from a single text to a plurality of similar and different semantic texts, and the language of famous literary works collected by a literary work matrix library is output, so that the output sentences contain stronger literary property, the excitement of authors is realized during writing, the semantic association mechanism can be used in various application fields such as drama, literary creation, self-media literary writing and the like, and the semantic association mechanism has stronger practicability and universality.

Example 2

The method can be combined with an image recognition algorithm, can recognize patterns in the picture such as the sun, the sky and the fog, can convert the words and phrases by the steps in the embodiment 1 according to the output word group text after recognition, and finally outputs the words and phrases into corresponding literary works, thereby achieving the effect of matching the characters and the paragraphs of the picture and realizing the effect of quick creation.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An artificial intelligence auxiliary writing system comprises a writing system, and is characterized in that the writing system comprises an information processing module, a word vector semantic module, a sentence vector semantic module and a sentence vector matrix module, wherein the word vector semantic module comprises a CBOW model neural network training module, the information processing module comprises an information collecting module, a text box input module and a text box output module, the sentence vector semantic module comprises a sentence vector combination algorithm, and the sentence vector matrix module comprises a semantic matrix association algorithm, and specifically comprises the following steps:

2. The artificial intelligence aided authoring system of claim 1, wherein the information processing module comprises a web crawler technology or a web API platform external port, and is mainly used for extracting literary work information.

3. The artificial intelligence aided writing system of claim 1, wherein the CBOW model neural network training module is mainly used under a word2vec bag-of-words algorithm model, the training process of the CBOW model neural network training module is to extract some literature sentences from a large number of literature sentences as training data, extract a phrase w (t) for each sentence, and predict w (t) through context words w (t-2), w (t-1), w (t +1) and w (t +2), and the trained CBOW model neural network training module can quantize word strings, and the method comprises the following steps:

4. The artificial intelligence aided writing system of claim 2, wherein the sentence vector combination algorithm is calculated based on a CBOW model neural network training module, and the sentence vectors are formed by word vectors obtained by the CBOW model neural network training module by: setting n words contained in the target sentence A according to the obtained word vectors, wherein each word is represented by an m-dimensional word vector in a word vector library, and the set of word vectors contained in the sentence A is X (X)₁，X₂……X_n) Where each word vector may be represented as:

X₁＝[X₁₁，X₁₂……X_1m]

X₂＝[X₂₁，X₂₂……X_2m]

……

X_n＝[X_n1，X_n2……X_nm]

if the semantic feature vector of sentence a is Avec, the algorithm of Avec is:

Y₁＝(X₁₁+X₂₁+……+X_n1)/n

Y₂＝(X₁₂+X₂₂+……+X_n2)/n

……

Y_m＝(X_1m+X_2m+……+X_nm)/n

the output matrixes are combined to form a literary work matrix library G.

5. An artificial intelligence aided writing system according to claim 4, wherein said semantic association algorithm mainly combines the target text and the literary work matrix library for calculation, and includes Euclidean distance formula, comprising the following steps:

6. An artificial intelligence aided authoring system as claimed in claim 5 wherein said semantic association algorithm comprises an algorithm simplification process comprising the steps of:

first defining a transformation matrix of m rows

E＝D⊙D，

7. the system of claim 1, wherein the semantic matrix association algorithm is implemented on a GPU.