CN107688621B

CN107688621B - Method and system for optimizing file

Info

Publication number: CN107688621B
Application number: CN201710698292.4A
Authority: CN
Inventors: 刘月明; 梁岚; 李舰
Original assignee: Aim Shanghai Culture Medium Co ltd
Current assignee: Aim Shanghai Culture Medium Co ltd
Priority date: 2017-08-15
Filing date: 2017-08-15
Publication date: 2021-06-29
Anticipated expiration: 2037-08-15
Also published as: CN107688621A

Abstract

The invention discloses a method and a system for optimizing a file, wherein the method comprises the following steps: capturing a text in the case to obtain an original text; processing the original text to obtain a plurality of first target texts; receiving a second target text selected by a user from the plurality of first target texts; calculating the corresponding heat degree of each hot word in each first target text according to the Newton cooling law; calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model; displaying a corresponding hot word recommendation text to the user according to the similarity; the effect is as follows: the user can optimize own copy works through simple replacement operation, and the working efficiency of the user is improved while the workload of the user is reduced.

Description

Method and system for optimizing file

Technical Field

The invention belongs to the technical field of computer text information, and particularly relates to a method and a system for optimizing a file.

Background

A good document needs to consider the color and emotional expression of the document, and a section of words, a sentence or even a word can enable audiences to resonate or feel good, so that the good document is obtained. Every day, different new words or hot words are generated, and the words and sentences in a document may be a large highlight. In the traditional case optimization, the nearest hot topics or popular phrases are basically searched by manpower collection or a search engine, so that the workload is large, the working efficiency is low, and the requirements of creators cannot be met.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method and a system for optimizing a document, so as to solve the defects of large workload and low working efficiency in the prior art.

The invention adopts a technical scheme that a method for optimizing a file comprises the following steps:

capturing a text in the case to obtain an original text;

processing the original text to obtain a plurality of first target texts;

receiving a second target text selected by a user from the plurality of first target texts;

calculating the corresponding heat degree of each hot word in each first target text according to the Newton cooling law;

calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;

and displaying a corresponding hot word recommendation text to the user according to the similarity.

Preferably, the text in the case is captured by adopting a crawler technology.

Preferably, the original text is subjected to word segmentation in a jieba mode.

Preferably, the formula is adopted:

and T '(T) — k (T (T) — H), calculating the corresponding heat degree of each hot word in each target text, wherein T' (T) represents the rate of temperature change, the negative sign represents cooling, k represents a cooling coefficient, k >0, T (T) represents a time T function of the temperature T, and H represents room temperature.

Preferably, the calculating the similarity between the remaining first target text and the second target text according to the preset word2vec model specifically includes:

preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;

extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;

and inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation.

Preferably, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.

The invention adopts another technical scheme that the system for optimizing the file comprises an extraction unit, a preprocessing unit, a receiving unit, a processing unit and a display unit;

the extraction unit is used for capturing texts in the file to obtain original texts;

the preprocessing unit is used for processing the original text to obtain a plurality of first target texts;

the receiving unit is used for receiving a second target text selected by a user from the plurality of first target texts;

the processing unit comprises a first calculating unit and a second calculating unit, the first calculating unit is used for calculating the corresponding heat degree of each hot word in each first target text according to Newton's cooling law, and the second calculating unit is used for calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;

and the display unit is used for displaying the corresponding hot word recommendation text to the user according to the similarity.

Preferably, the second calculating unit specifically includes:

and inputting the feature data and the second target text into the word2vec model for similarity calculation.

By adopting the technical scheme, compared with the prior art, the method and the device have the advantages that the text in the file is subjected to word segmentation to obtain a plurality of first target texts, the hot word degree of each first target text is calculated by combining Newton's cooling law, the similarity between the remaining first target text and the second target text is calculated according to a word2vec model, the hot word recommendation texts are sequenced according to the similarity, the corresponding hot word recommendation texts are recommended to a user, the user can optimize own file works through simple replacement operation, the workload of the user is reduced, and meanwhile the working efficiency of the user is improved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

fig. 2 is a block diagram of the system of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments, and the description herein does not mean that all the subject matter corresponding to the specific examples set forth in the embodiments is cited in the claims.

Referring to fig. 1, a method for optimizing a document includes the following steps:

s101, capturing texts in the file to obtain original texts;

specifically, the crawler technology is adopted to capture the text in the document, so that the corresponding text is prevented from being obtained through manual search, the working efficiency of a user is improved, and in practical application, the corresponding text can also be obtained in a purchasing mode, wherein the text can be a sentence, a paragraph or a chapter.

S102, processing the original text to obtain a plurality of first target texts;

specifically, the original text is subjected to word segmentation processing in a jieba mode, and some meaningless words can be removed through the processing; the method specifically comprises the following steps: firstly, performing word segmentation and part-of-speech standards, and taking words meeting specified parts-of-speech as candidate words; and then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as possible keywords, and taking the keywords as a first target text.

Wherein, the calculation formula of TF (term frequency) word frequency is as follows:

TF1 is N/M, wherein N represents the number of words appearing in the characteristic item, and M is the number of words in the text information;

the calculation formula of the IDF (inverse Document frequency) reverse text frequency is as follows:

IDF is log D/Dw, where D denotes the total number of text messages and Dw denotes the number of text messages in which the keyword appears.

S103, receiving a second target text selected by a user from the plurality of first target texts;

specifically, the user selects the corresponding word to be optimized, which is convenient for the user to operate.

S104, calculating the corresponding heat degree of each hotword in each first target text according to the Newton' S cooling law;

specifically, the formula is adopted:

S105, calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;

specifically, after being preprocessed, the rest first target texts are input into a word2vec model to be trained to obtain a multi-dimensional word vector;

and inputting the feature data and a second target text into the word2vec model for similarity calculation.

And S106, displaying the corresponding hot word recommendation text to the user according to the similarity.

Further, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.

By adopting the scheme, a user carries out word segmentation processing through texts in a case to obtain a plurality of first target texts, calculates the hot word degree of each first target text by combining Newton's cooling law, calculates the similarity between the rest first target texts and the second target text according to a word2vec model, carries out sequencing according to the similarity, and recommends corresponding hot word recommendation texts to the user.

For example, the original text of 'i will learn to forget you' is participled to obtain 'i will', 'learning to' and 'forget', the plurality of first target texts are then selected as the text to be adjusted by the user, so that 'i will' becomes the second target text, corresponding hot word recommendation texts such as 'i', 'you', 'your' and 'want' are obtained according to calculation, the user can select corresponding hot words to replace according to own needs, the original text is optimized, the workload of the user is reduced, and the working efficiency of the user is also improved.

Referring to fig. 2, a system for optimizing a document includes an extracting unit, a preprocessing unit, a receiving unit, a processing unit, and a display unit;

Specifically, the corresponding recommended hot words can be displayed according to a descending order mode, and the user can select the corresponding recommended hot words to replace according to the needs of the user, so that the optimization of the case is realized.

Further, in order to accurately calculate the similarity, a hotword recommendation text with high relevance is obtained, where the second calculating unit specifically includes:

Finally, while the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A method for optimizing a document, comprising the steps of:

capturing a text in the case to obtain an original text;

processing the original text to obtain a plurality of first target texts, namely, firstly, performing word segmentation processing on the original text, and taking words meeting specified parts of speech as candidate words; then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as a keyword, and taking the keyword as a first target text;

receiving a second target text selected by a user from the plurality of first target texts, wherein the user selects corresponding words needing to be optimized;

calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model; the method specifically comprises the following steps: preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;

inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation;

2. The method of claim 1, wherein the text in the document is captured by a crawler technique.

3. The method for optimizing a document according to claim 1, wherein a jieba mode is adopted to perform word segmentation on the original text.

4. The method of claim 1, wherein the formula is:

5. The method of claim 4, wherein the preprocessing includes filtering stop words, filtering punctuation marks, and filtering emoticons.

6. The system for optimizing the file is characterized by comprising an extraction unit, a preprocessing unit, a receiving unit, a processing unit and a display unit;

the preprocessing unit is used for processing the original text to obtain a plurality of first target texts, namely, firstly, performing word segmentation processing on the original text, and taking words meeting the specified part of speech as candidate words; then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as a keyword, and taking the keyword as a first target text;

the display unit is used for displaying corresponding hot word recommendation texts to a user according to the similarity;

the second computing unit specifically includes:

7. The system of claim 6, wherein the preprocessing comprises filtering stop words, filtering punctuation marks, and filtering emoticons.