CN107688621B - Method and system for optimizing file - Google Patents
Method and system for optimizing file Download PDFInfo
- Publication number
- CN107688621B CN107688621B CN201710698292.4A CN201710698292A CN107688621B CN 107688621 B CN107688621 B CN 107688621B CN 201710698292 A CN201710698292 A CN 201710698292A CN 107688621 B CN107688621 B CN 107688621B
- Authority
- CN
- China
- Prior art keywords
- text
- target
- texts
- calculating
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for optimizing a file, wherein the method comprises the following steps: capturing a text in the case to obtain an original text; processing the original text to obtain a plurality of first target texts; receiving a second target text selected by a user from the plurality of first target texts; calculating the corresponding heat degree of each hot word in each first target text according to the Newton cooling law; calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model; displaying a corresponding hot word recommendation text to the user according to the similarity; the effect is as follows: the user can optimize own copy works through simple replacement operation, and the working efficiency of the user is improved while the workload of the user is reduced.
Description
Technical Field
The invention belongs to the technical field of computer text information, and particularly relates to a method and a system for optimizing a file.
Background
A good document needs to consider the color and emotional expression of the document, and a section of words, a sentence or even a word can enable audiences to resonate or feel good, so that the good document is obtained. Every day, different new words or hot words are generated, and the words and sentences in a document may be a large highlight. In the traditional case optimization, the nearest hot topics or popular phrases are basically searched by manpower collection or a search engine, so that the workload is large, the working efficiency is low, and the requirements of creators cannot be met.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and a system for optimizing a document, so as to solve the defects of large workload and low working efficiency in the prior art.
The invention adopts a technical scheme that a method for optimizing a file comprises the following steps:
capturing a text in the case to obtain an original text;
processing the original text to obtain a plurality of first target texts;
receiving a second target text selected by a user from the plurality of first target texts;
calculating the corresponding heat degree of each hot word in each first target text according to the Newton cooling law;
calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
and displaying a corresponding hot word recommendation text to the user according to the similarity.
Preferably, the text in the case is captured by adopting a crawler technology.
Preferably, the original text is subjected to word segmentation in a jieba mode.
Preferably, the formula is adopted:
and T '(T) — k (T (T) — H), calculating the corresponding heat degree of each hot word in each target text, wherein T' (T) represents the rate of temperature change, the negative sign represents cooling, k represents a cooling coefficient, k >0, T (T) represents a time T function of the temperature T, and H represents room temperature.
Preferably, the calculating the similarity between the remaining first target text and the second target text according to the preset word2vec model specifically includes:
preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation.
Preferably, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.
The invention adopts another technical scheme that the system for optimizing the file comprises an extraction unit, a preprocessing unit, a receiving unit, a processing unit and a display unit;
the extraction unit is used for capturing texts in the file to obtain original texts;
the preprocessing unit is used for processing the original text to obtain a plurality of first target texts;
the receiving unit is used for receiving a second target text selected by a user from the plurality of first target texts;
the processing unit comprises a first calculating unit and a second calculating unit, the first calculating unit is used for calculating the corresponding heat degree of each hot word in each first target text according to Newton's cooling law, and the second calculating unit is used for calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
and the display unit is used for displaying the corresponding hot word recommendation text to the user according to the similarity.
Preferably, the second calculating unit specifically includes:
preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and the second target text into the word2vec model for similarity calculation.
Preferably, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.
By adopting the technical scheme, compared with the prior art, the method and the device have the advantages that the text in the file is subjected to word segmentation to obtain a plurality of first target texts, the hot word degree of each first target text is calculated by combining Newton's cooling law, the similarity between the remaining first target text and the second target text is calculated according to a word2vec model, the hot word recommendation texts are sequenced according to the similarity, the corresponding hot word recommendation texts are recommended to a user, the user can optimize own file works through simple replacement operation, the workload of the user is reduced, and meanwhile the working efficiency of the user is improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
fig. 2 is a block diagram of the system of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments, and the description herein does not mean that all the subject matter corresponding to the specific examples set forth in the embodiments is cited in the claims.
Referring to fig. 1, a method for optimizing a document includes the following steps:
s101, capturing texts in the file to obtain original texts;
specifically, the crawler technology is adopted to capture the text in the document, so that the corresponding text is prevented from being obtained through manual search, the working efficiency of a user is improved, and in practical application, the corresponding text can also be obtained in a purchasing mode, wherein the text can be a sentence, a paragraph or a chapter.
S102, processing the original text to obtain a plurality of first target texts;
specifically, the original text is subjected to word segmentation processing in a jieba mode, and some meaningless words can be removed through the processing; the method specifically comprises the following steps: firstly, performing word segmentation and part-of-speech standards, and taking words meeting specified parts-of-speech as candidate words; and then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as possible keywords, and taking the keywords as a first target text.
Wherein, the calculation formula of TF (term frequency) word frequency is as follows:
TF1 is N/M, wherein N represents the number of words appearing in the characteristic item, and M is the number of words in the text information;
the calculation formula of the IDF (inverse Document frequency) reverse text frequency is as follows:
IDF is log D/Dw, where D denotes the total number of text messages and Dw denotes the number of text messages in which the keyword appears.
S103, receiving a second target text selected by a user from the plurality of first target texts;
specifically, the user selects the corresponding word to be optimized, which is convenient for the user to operate.
S104, calculating the corresponding heat degree of each hotword in each first target text according to the Newton' S cooling law;
specifically, the formula is adopted:
and T '(T) — k (T (T) — H), calculating the corresponding heat degree of each hot word in each target text, wherein T' (T) represents the rate of temperature change, the negative sign represents cooling, k represents a cooling coefficient, k >0, T (T) represents a time T function of the temperature T, and H represents room temperature.
S105, calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
specifically, after being preprocessed, the rest first target texts are input into a word2vec model to be trained to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and a second target text into the word2vec model for similarity calculation.
And S106, displaying the corresponding hot word recommendation text to the user according to the similarity.
Further, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.
By adopting the scheme, a user carries out word segmentation processing through texts in a case to obtain a plurality of first target texts, calculates the hot word degree of each first target text by combining Newton's cooling law, calculates the similarity between the rest first target texts and the second target text according to a word2vec model, carries out sequencing according to the similarity, and recommends corresponding hot word recommendation texts to the user.
For example, the original text of 'i will learn to forget you' is participled to obtain 'i will', 'learning to' and 'forget', the plurality of first target texts are then selected as the text to be adjusted by the user, so that 'i will' becomes the second target text, corresponding hot word recommendation texts such as 'i', 'you', 'your' and 'want' are obtained according to calculation, the user can select corresponding hot words to replace according to own needs, the original text is optimized, the workload of the user is reduced, and the working efficiency of the user is also improved.
Referring to fig. 2, a system for optimizing a document includes an extracting unit, a preprocessing unit, a receiving unit, a processing unit, and a display unit;
the extraction unit is used for capturing texts in the file to obtain original texts;
the preprocessing unit is used for processing the original text to obtain a plurality of first target texts;
the receiving unit is used for receiving a second target text selected by a user from the plurality of first target texts;
the processing unit comprises a first calculating unit and a second calculating unit, the first calculating unit is used for calculating the corresponding heat degree of each hot word in each first target text according to Newton's cooling law, and the second calculating unit is used for calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
and the display unit is used for displaying the corresponding hot word recommendation text to the user according to the similarity.
Specifically, the corresponding recommended hot words can be displayed according to a descending order mode, and the user can select the corresponding recommended hot words to replace according to the needs of the user, so that the optimization of the case is realized.
Further, in order to accurately calculate the similarity, a hotword recommendation text with high relevance is obtained, where the second calculating unit specifically includes:
preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation.
Further, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.
Finally, while the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
Claims (7)
1. A method for optimizing a document, comprising the steps of:
capturing a text in the case to obtain an original text;
processing the original text to obtain a plurality of first target texts, namely, firstly, performing word segmentation processing on the original text, and taking words meeting specified parts of speech as candidate words; then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as a keyword, and taking the keyword as a first target text;
receiving a second target text selected by a user from the plurality of first target texts, wherein the user selects corresponding words needing to be optimized;
calculating the corresponding heat degree of each hot word in each first target text according to the Newton cooling law;
calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model; the method specifically comprises the following steps: preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation;
and displaying a corresponding hot word recommendation text to the user according to the similarity.
2. The method of claim 1, wherein the text in the document is captured by a crawler technique.
3. The method for optimizing a document according to claim 1, wherein a jieba mode is adopted to perform word segmentation on the original text.
4. The method of claim 1, wherein the formula is:
and T '(T) — k (T (T) — H), calculating the corresponding heat degree of each hot word in each target text, wherein T' (T) represents the rate of temperature change, the negative sign represents cooling, k represents a cooling coefficient, k >0, T (T) represents a time T function of the temperature T, and H represents room temperature.
5. The method of claim 4, wherein the preprocessing includes filtering stop words, filtering punctuation marks, and filtering emoticons.
6. The system for optimizing the file is characterized by comprising an extraction unit, a preprocessing unit, a receiving unit, a processing unit and a display unit;
the extraction unit is used for capturing texts in the file to obtain original texts;
the preprocessing unit is used for processing the original text to obtain a plurality of first target texts, namely, firstly, performing word segmentation processing on the original text, and taking words meeting the specified part of speech as candidate words; then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as a keyword, and taking the keyword as a first target text;
the receiving unit is used for receiving a second target text selected by a user from the plurality of first target texts;
the processing unit comprises a first calculating unit and a second calculating unit, the first calculating unit is used for calculating the corresponding heat degree of each hot word in each first target text according to Newton's cooling law, and the second calculating unit is used for calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
the display unit is used for displaying corresponding hot word recommendation texts to a user according to the similarity;
the second computing unit specifically includes:
preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation.
7. The system of claim 6, wherein the preprocessing comprises filtering stop words, filtering punctuation marks, and filtering emoticons.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710698292.4A CN107688621B (en) | 2017-08-15 | 2017-08-15 | Method and system for optimizing file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710698292.4A CN107688621B (en) | 2017-08-15 | 2017-08-15 | Method and system for optimizing file |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107688621A CN107688621A (en) | 2018-02-13 |
CN107688621B true CN107688621B (en) | 2021-06-29 |
Family
ID=61153398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710698292.4A Active CN107688621B (en) | 2017-08-15 | 2017-08-15 | Method and system for optimizing file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107688621B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509417B (en) * | 2018-03-20 | 2022-03-15 | 腾讯科技(深圳)有限公司 | Title generation method and device, storage medium and server |
CN111340551A (en) * | 2020-02-27 | 2020-06-26 | 广东博智林机器人有限公司 | Method, device, terminal and storage medium for generating advertisement content |
CN112015975B (en) * | 2020-07-15 | 2023-11-14 | 北京淇瑀信息科技有限公司 | Information pushing method and device for financial users based on Newton's law of cooling |
CN113572753B (en) * | 2021-07-16 | 2023-03-14 | 北京淇瑀信息科技有限公司 | User equipment authentication method and device based on Newton's cooling law |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8312022B2 (en) * | 2008-03-21 | 2012-11-13 | Ramp Holdings, Inc. | Search engine optimization |
CN103377232A (en) * | 2012-04-25 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Headline keyword recommendation method and system |
CN104636334A (en) * | 2013-11-06 | 2015-05-20 | 阿里巴巴集团控股有限公司 | Keyword recommending method and device |
CN106649536A (en) * | 2016-11-01 | 2017-05-10 | 四川用联信息技术有限公司 | Achievement of optimization of search engine keywords based on improved k Means algorithm |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7693830B2 (en) * | 2005-08-10 | 2010-04-06 | Google Inc. | Programmable search engine |
-
2017
- 2017-08-15 CN CN201710698292.4A patent/CN107688621B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8312022B2 (en) * | 2008-03-21 | 2012-11-13 | Ramp Holdings, Inc. | Search engine optimization |
CN103377232A (en) * | 2012-04-25 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Headline keyword recommendation method and system |
CN104636334A (en) * | 2013-11-06 | 2015-05-20 | 阿里巴巴集团控股有限公司 | Keyword recommending method and device |
CN106649536A (en) * | 2016-11-01 | 2017-05-10 | 四川用联信息技术有限公司 | Achievement of optimization of search engine keywords based on improved k Means algorithm |
Non-Patent Citations (1)
Title |
---|
新词识别和热词排名方法研究;耿升华;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140315(第3期);正文第5.2小节 * |
Also Published As
Publication number | Publication date |
---|---|
CN107688621A (en) | 2018-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107688621B (en) | Method and system for optimizing file | |
CN106874292B (en) | Topic processing method and device | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN111797214A (en) | FAQ database-based problem screening method and device, computer equipment and medium | |
JP6394388B2 (en) | Synonym relation determination device, synonym relation determination method, and program thereof | |
CN112667794A (en) | Intelligent question-answer matching method and system based on twin network BERT model | |
CN110147425B (en) | Keyword extraction method and device, computer equipment and storage medium | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
CN111190997A (en) | Question-answering system implementation method using neural network and machine learning sequencing algorithm | |
CN108319734A (en) | A kind of product feature structure tree method for auto constructing based on linear combiner | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN111309864B (en) | User group emotional tendency migration dynamic analysis method for microblog hot topics | |
Heuer | Text comparison using word vector representations and dimensionality reduction | |
CN111552773A (en) | Method and system for searching key sentence of question or not in reading and understanding task | |
CN112182145A (en) | Text similarity determination method, device, equipment and storage medium | |
Hu et al. | Text sentiment analysis: A review | |
CN113627797B (en) | Method, device, computer equipment and storage medium for generating staff member portrait | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
CN114141384A (en) | Method, apparatus and medium for retrieving medical data | |
CN114138969A (en) | Text processing method and device | |
CN113434636A (en) | Semantic-based approximate text search method and device, computer equipment and medium | |
CN116932736A (en) | Patent recommendation method based on combination of user requirements and inverted list | |
Marusenko et al. | Mathematical methods for attributing literary works when solving the “Corneille–Molière” problem | |
CN115130453A (en) | Interactive information generation method and device | |
CN114417863A (en) | Word weight generation model training method and device and word weight generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |