CN107688621B - Method and system for optimizing file - Google Patents

Method and system for optimizing file Download PDF

Info

Publication number
CN107688621B
CN107688621B CN201710698292.4A CN201710698292A CN107688621B CN 107688621 B CN107688621 B CN 107688621B CN 201710698292 A CN201710698292 A CN 201710698292A CN 107688621 B CN107688621 B CN 107688621B
Authority
CN
China
Prior art keywords
text
target
texts
calculating
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710698292.4A
Other languages
Chinese (zh)
Other versions
CN107688621A (en
Inventor
刘月明
梁岚
李舰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aim Shanghai Culture Medium Co ltd
Original Assignee
Aim Shanghai Culture Medium Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aim Shanghai Culture Medium Co ltd filed Critical Aim Shanghai Culture Medium Co ltd
Priority to CN201710698292.4A priority Critical patent/CN107688621B/en
Publication of CN107688621A publication Critical patent/CN107688621A/en
Application granted granted Critical
Publication of CN107688621B publication Critical patent/CN107688621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for optimizing a file, wherein the method comprises the following steps: capturing a text in the case to obtain an original text; processing the original text to obtain a plurality of first target texts; receiving a second target text selected by a user from the plurality of first target texts; calculating the corresponding heat degree of each hot word in each first target text according to the Newton cooling law; calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model; displaying a corresponding hot word recommendation text to the user according to the similarity; the effect is as follows: the user can optimize own copy works through simple replacement operation, and the working efficiency of the user is improved while the workload of the user is reduced.

Description

Method and system for optimizing file
Technical Field
The invention belongs to the technical field of computer text information, and particularly relates to a method and a system for optimizing a file.
Background
A good document needs to consider the color and emotional expression of the document, and a section of words, a sentence or even a word can enable audiences to resonate or feel good, so that the good document is obtained. Every day, different new words or hot words are generated, and the words and sentences in a document may be a large highlight. In the traditional case optimization, the nearest hot topics or popular phrases are basically searched by manpower collection or a search engine, so that the workload is large, the working efficiency is low, and the requirements of creators cannot be met.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and a system for optimizing a document, so as to solve the defects of large workload and low working efficiency in the prior art.
The invention adopts a technical scheme that a method for optimizing a file comprises the following steps:
capturing a text in the case to obtain an original text;
processing the original text to obtain a plurality of first target texts;
receiving a second target text selected by a user from the plurality of first target texts;
calculating the corresponding heat degree of each hot word in each first target text according to the Newton cooling law;
calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
and displaying a corresponding hot word recommendation text to the user according to the similarity.
Preferably, the text in the case is captured by adopting a crawler technology.
Preferably, the original text is subjected to word segmentation in a jieba mode.
Preferably, the formula is adopted:
and T '(T) — k (T (T) — H), calculating the corresponding heat degree of each hot word in each target text, wherein T' (T) represents the rate of temperature change, the negative sign represents cooling, k represents a cooling coefficient, k >0, T (T) represents a time T function of the temperature T, and H represents room temperature.
Preferably, the calculating the similarity between the remaining first target text and the second target text according to the preset word2vec model specifically includes:
preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation.
Preferably, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.
The invention adopts another technical scheme that the system for optimizing the file comprises an extraction unit, a preprocessing unit, a receiving unit, a processing unit and a display unit;
the extraction unit is used for capturing texts in the file to obtain original texts;
the preprocessing unit is used for processing the original text to obtain a plurality of first target texts;
the receiving unit is used for receiving a second target text selected by a user from the plurality of first target texts;
the processing unit comprises a first calculating unit and a second calculating unit, the first calculating unit is used for calculating the corresponding heat degree of each hot word in each first target text according to Newton's cooling law, and the second calculating unit is used for calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
and the display unit is used for displaying the corresponding hot word recommendation text to the user according to the similarity.
Preferably, the second calculating unit specifically includes:
preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and the second target text into the word2vec model for similarity calculation.
Preferably, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.
By adopting the technical scheme, compared with the prior art, the method and the device have the advantages that the text in the file is subjected to word segmentation to obtain a plurality of first target texts, the hot word degree of each first target text is calculated by combining Newton's cooling law, the similarity between the remaining first target text and the second target text is calculated according to a word2vec model, the hot word recommendation texts are sequenced according to the similarity, the corresponding hot word recommendation texts are recommended to a user, the user can optimize own file works through simple replacement operation, the workload of the user is reduced, and meanwhile the working efficiency of the user is improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
fig. 2 is a block diagram of the system of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments, and the description herein does not mean that all the subject matter corresponding to the specific examples set forth in the embodiments is cited in the claims.
Referring to fig. 1, a method for optimizing a document includes the following steps:
s101, capturing texts in the file to obtain original texts;
specifically, the crawler technology is adopted to capture the text in the document, so that the corresponding text is prevented from being obtained through manual search, the working efficiency of a user is improved, and in practical application, the corresponding text can also be obtained in a purchasing mode, wherein the text can be a sentence, a paragraph or a chapter.
S102, processing the original text to obtain a plurality of first target texts;
specifically, the original text is subjected to word segmentation processing in a jieba mode, and some meaningless words can be removed through the processing; the method specifically comprises the following steps: firstly, performing word segmentation and part-of-speech standards, and taking words meeting specified parts-of-speech as candidate words; and then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as possible keywords, and taking the keywords as a first target text.
Wherein, the calculation formula of TF (term frequency) word frequency is as follows:
TF1 is N/M, wherein N represents the number of words appearing in the characteristic item, and M is the number of words in the text information;
the calculation formula of the IDF (inverse Document frequency) reverse text frequency is as follows:
IDF is log D/Dw, where D denotes the total number of text messages and Dw denotes the number of text messages in which the keyword appears.
S103, receiving a second target text selected by a user from the plurality of first target texts;
specifically, the user selects the corresponding word to be optimized, which is convenient for the user to operate.
S104, calculating the corresponding heat degree of each hotword in each first target text according to the Newton' S cooling law;
specifically, the formula is adopted:
and T '(T) — k (T (T) — H), calculating the corresponding heat degree of each hot word in each target text, wherein T' (T) represents the rate of temperature change, the negative sign represents cooling, k represents a cooling coefficient, k >0, T (T) represents a time T function of the temperature T, and H represents room temperature.
S105, calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
specifically, after being preprocessed, the rest first target texts are input into a word2vec model to be trained to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and a second target text into the word2vec model for similarity calculation.
And S106, displaying the corresponding hot word recommendation text to the user according to the similarity.
Further, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.
By adopting the scheme, a user carries out word segmentation processing through texts in a case to obtain a plurality of first target texts, calculates the hot word degree of each first target text by combining Newton's cooling law, calculates the similarity between the rest first target texts and the second target text according to a word2vec model, carries out sequencing according to the similarity, and recommends corresponding hot word recommendation texts to the user.
For example, the original text of 'i will learn to forget you' is participled to obtain 'i will', 'learning to' and 'forget', the plurality of first target texts are then selected as the text to be adjusted by the user, so that 'i will' becomes the second target text, corresponding hot word recommendation texts such as 'i', 'you', 'your' and 'want' are obtained according to calculation, the user can select corresponding hot words to replace according to own needs, the original text is optimized, the workload of the user is reduced, and the working efficiency of the user is also improved.
Referring to fig. 2, a system for optimizing a document includes an extracting unit, a preprocessing unit, a receiving unit, a processing unit, and a display unit;
the extraction unit is used for capturing texts in the file to obtain original texts;
the preprocessing unit is used for processing the original text to obtain a plurality of first target texts;
the receiving unit is used for receiving a second target text selected by a user from the plurality of first target texts;
the processing unit comprises a first calculating unit and a second calculating unit, the first calculating unit is used for calculating the corresponding heat degree of each hot word in each first target text according to Newton's cooling law, and the second calculating unit is used for calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
and the display unit is used for displaying the corresponding hot word recommendation text to the user according to the similarity.
Specifically, the corresponding recommended hot words can be displayed according to a descending order mode, and the user can select the corresponding recommended hot words to replace according to the needs of the user, so that the optimization of the case is realized.
Further, in order to accurately calculate the similarity, a hotword recommendation text with high relevance is obtained, where the second calculating unit specifically includes:
preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation.
Further, the preprocessing specifically includes filtering stop words, filtering punctuation coincidence and filtering expression coincidence.
Finally, while the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (7)

1. A method for optimizing a document, comprising the steps of:
capturing a text in the case to obtain an original text;
processing the original text to obtain a plurality of first target texts, namely, firstly, performing word segmentation processing on the original text, and taking words meeting specified parts of speech as candidate words; then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as a keyword, and taking the keyword as a first target text;
receiving a second target text selected by a user from the plurality of first target texts, wherein the user selects corresponding words needing to be optimized;
calculating the corresponding heat degree of each hot word in each first target text according to the Newton cooling law;
calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model; the method specifically comprises the following steps: preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation;
and displaying a corresponding hot word recommendation text to the user according to the similarity.
2. The method of claim 1, wherein the text in the document is captured by a crawler technique.
3. The method for optimizing a document according to claim 1, wherein a jieba mode is adopted to perform word segmentation on the original text.
4. The method of claim 1, wherein the formula is:
and T '(T) — k (T (T) — H), calculating the corresponding heat degree of each hot word in each target text, wherein T' (T) represents the rate of temperature change, the negative sign represents cooling, k represents a cooling coefficient, k >0, T (T) represents a time T function of the temperature T, and H represents room temperature.
5. The method of claim 4, wherein the preprocessing includes filtering stop words, filtering punctuation marks, and filtering emoticons.
6. The system for optimizing the file is characterized by comprising an extraction unit, a preprocessing unit, a receiving unit, a processing unit and a display unit;
the extraction unit is used for capturing texts in the file to obtain original texts;
the preprocessing unit is used for processing the original text to obtain a plurality of first target texts, namely, firstly, performing word segmentation processing on the original text, and taking words meeting the specified part of speech as candidate words; then calculating the TF-IDF value of each candidate word, arranging the TF-IDF values of each candidate word in a descending order, outputting a specified number of words as a keyword, and taking the keyword as a first target text;
the receiving unit is used for receiving a second target text selected by a user from the plurality of first target texts;
the processing unit comprises a first calculating unit and a second calculating unit, the first calculating unit is used for calculating the corresponding heat degree of each hot word in each first target text according to Newton's cooling law, and the second calculating unit is used for calculating the similarity between the remaining first target text and the second target text according to a preset word2vec model;
the display unit is used for displaying corresponding hot word recommendation texts to a user according to the similarity;
the second computing unit specifically includes:
preprocessing the rest first target texts, inputting the preprocessed first target texts into a word2vec model, and training to obtain a multi-dimensional word vector;
extracting feature items of the multi-dimensional word vectors to obtain corresponding feature data;
and inputting the feature data and a second target text selected by a user into the word2vec model for similarity calculation.
7. The system of claim 6, wherein the preprocessing comprises filtering stop words, filtering punctuation marks, and filtering emoticons.
CN201710698292.4A 2017-08-15 2017-08-15 Method and system for optimizing file Active CN107688621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710698292.4A CN107688621B (en) 2017-08-15 2017-08-15 Method and system for optimizing file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710698292.4A CN107688621B (en) 2017-08-15 2017-08-15 Method and system for optimizing file

Publications (2)

Publication Number Publication Date
CN107688621A CN107688621A (en) 2018-02-13
CN107688621B true CN107688621B (en) 2021-06-29

Family

ID=61153398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710698292.4A Active CN107688621B (en) 2017-08-15 2017-08-15 Method and system for optimizing file

Country Status (1)

Country Link
CN (1) CN107688621B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509417B (en) * 2018-03-20 2022-03-15 腾讯科技(深圳)有限公司 Title generation method and device, storage medium and server
CN111340551A (en) * 2020-02-27 2020-06-26 广东博智林机器人有限公司 Method, device, terminal and storage medium for generating advertisement content
CN112015975B (en) * 2020-07-15 2023-11-14 北京淇瑀信息科技有限公司 Information pushing method and device for financial users based on Newton's law of cooling
CN113572753B (en) * 2021-07-16 2023-03-14 北京淇瑀信息科技有限公司 User equipment authentication method and device based on Newton's cooling law

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8312022B2 (en) * 2008-03-21 2012-11-13 Ramp Holdings, Inc. Search engine optimization
CN103377232A (en) * 2012-04-25 2013-10-30 阿里巴巴集团控股有限公司 Headline keyword recommendation method and system
CN104636334A (en) * 2013-11-06 2015-05-20 阿里巴巴集团控股有限公司 Keyword recommending method and device
CN106649536A (en) * 2016-11-01 2017-05-10 四川用联信息技术有限公司 Achievement of optimization of search engine keywords based on improved k Means algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7693830B2 (en) * 2005-08-10 2010-04-06 Google Inc. Programmable search engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8312022B2 (en) * 2008-03-21 2012-11-13 Ramp Holdings, Inc. Search engine optimization
CN103377232A (en) * 2012-04-25 2013-10-30 阿里巴巴集团控股有限公司 Headline keyword recommendation method and system
CN104636334A (en) * 2013-11-06 2015-05-20 阿里巴巴集团控股有限公司 Keyword recommending method and device
CN106649536A (en) * 2016-11-01 2017-05-10 四川用联信息技术有限公司 Achievement of optimization of search engine keywords based on improved k Means algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
新词识别和热词排名方法研究;耿升华;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140315(第3期);正文第5.2小节 *

Also Published As

Publication number Publication date
CN107688621A (en) 2018-02-13

Similar Documents

Publication Publication Date Title
CN107688621B (en) Method and system for optimizing file
CN106874292B (en) Topic processing method and device
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
JP6394388B2 (en) Synonym relation determination device, synonym relation determination method, and program thereof
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN110147425B (en) Keyword extraction method and device, computer equipment and storage medium
CN106126619A (en) A kind of video retrieval method based on video content and system
CN111190997A (en) Question-answering system implementation method using neural network and machine learning sequencing algorithm
CN108319734A (en) A kind of product feature structure tree method for auto constructing based on linear combiner
CN112559684A (en) Keyword extraction and information retrieval method
CN111309864B (en) User group emotional tendency migration dynamic analysis method for microblog hot topics
Heuer Text comparison using word vector representations and dimensionality reduction
CN111552773A (en) Method and system for searching key sentence of question or not in reading and understanding task
CN112182145A (en) Text similarity determination method, device, equipment and storage medium
Hu et al. Text sentiment analysis: A review
CN113627797B (en) Method, device, computer equipment and storage medium for generating staff member portrait
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
CN114141384A (en) Method, apparatus and medium for retrieving medical data
CN114138969A (en) Text processing method and device
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN116932736A (en) Patent recommendation method based on combination of user requirements and inverted list
Marusenko et al. Mathematical methods for attributing literary works when solving the “Corneille–Molière” problem
CN115130453A (en) Interactive information generation method and device
CN114417863A (en) Word weight generation model training method and device and word weight generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant