CN111159361A - Method and device for acquiring article and electronic equipment - Google Patents

Method and device for acquiring article and electronic equipment Download PDF

Info

Publication number
CN111159361A
CN111159361A CN201911422953.6A CN201911422953A CN111159361A CN 111159361 A CN111159361 A CN 111159361A CN 201911422953 A CN201911422953 A CN 201911422953A CN 111159361 A CN111159361 A CN 111159361A
Authority
CN
China
Prior art keywords
article
word
keyword
keywords
articles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911422953.6A
Other languages
Chinese (zh)
Other versions
CN111159361B (en
Inventor
徐磊
袁力
邸烁
胡坤歌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aershan Block Chain Alliance Technology Co ltd
Original Assignee
Beijing Aershan Block Chain Alliance Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aershan Block Chain Alliance Technology Co ltd filed Critical Beijing Aershan Block Chain Alliance Technology Co ltd
Priority to CN201911422953.6A priority Critical patent/CN111159361B/en
Publication of CN111159361A publication Critical patent/CN111159361A/en
Application granted granted Critical
Publication of CN111159361B publication Critical patent/CN111159361B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention provides a method and a device for acquiring an article and electronic equipment, wherein the method comprises the following steps: acquiring a keyword of specified tone; searching according to the keywords to obtain articles corresponding to the keywords; performing word segmentation processing on the article to obtain word segmentation files of the article; the word segmentation file comprises a plurality of word sequences of the article; comparing each word sequence in the word segmentation file with the keywords with the appointed tone respectively, and calculating the similarity between the word sequence and the keywords; selecting a word sequence with the highest similarity and the specified number as a target keyword; and continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database. The method and the device automatically acquire the article with the user appointed tone through the crawler technology, have high efficiency and improve the experience of the user.

Description

Method and device for acquiring article and electronic equipment
Technical Field
The invention relates to the technical field of crawlers, in particular to a method and a device for acquiring articles and electronic equipment.
Background
At present, internet articles are rich in types, novel in content, huge in data volume, various emerging media websites are developed endlessly, various media content forms are different, different users have different reading requirements, namely, each user prefers to read articles and media with specific tone, and how to automatically push the articles with specific tone to the users also becomes a main task of numerous media software. The prior method mainly obtains the article of the user specific tone through the Word2Ve tool, and although the method can obtain the article of the user specific tone, the method has the defect of low efficiency, thereby causing poor reading experience for the user.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for obtaining an article, and an electronic device, which automatically obtain an article with a user-specified tonality through a crawler technology, and have high efficiency, thereby improving user experience.
In a first aspect, an embodiment of the present invention provides a method for acquiring an article, which is applied to a server, and the method includes:
acquiring a keyword of specified tone;
searching according to the keywords to obtain articles corresponding to the keywords;
performing word segmentation processing on the article to obtain a word segmentation file of the article; wherein the word segmentation file comprises a plurality of word sequences of the article;
comparing each word sequence in the word segmentation file with the keyword with the appointed tone respectively, and calculating the similarity between the word sequence and the keyword;
selecting the word sequences with the highest similarity and the specified number as target keywords;
and continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the step of comparing each word sequence in the segmentation file with the keyword with a specified tone, and calculating a similarity between the word sequence and the keyword includes:
inputting the word segmentation file into a pre-trained word training model to output a word vector of each word sequence;
and respectively calculating the similarity between the word sequence and the keyword through the word vector and the keyword vector of the keyword with specified tone.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where after the participle file is input to a pre-trained word training model, the method further includes:
outputting article vectors of articles corresponding to the word segmentation files through the word training model trained in advance;
and calculating the similarity between the searched article and the article stored in the article database according to the article vector so as to judge the repeatability of the searched article.
With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the step of performing a search according to the keyword to obtain an article corresponding to the keyword includes:
acquiring a designated website address input by a user;
and searching on the website corresponding to the specified website address according to the keyword to obtain an article corresponding to the keyword.
With reference to the third possible implementation manner of the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the step of searching for the article corresponding to the keyword on the website corresponding to the specified website address according to the keyword further includes:
inputting the keywords into a preset crawler program;
and searching on the website corresponding to the specified website address according to the keyword through the crawler program to obtain the article corresponding to the keyword.
In a second aspect, an embodiment of the present invention further provides an apparatus for acquiring an article, where the apparatus is applied to a server, and the apparatus includes:
the acquisition module is used for acquiring keywords of specified tone;
the first search module is used for searching according to the keywords to obtain articles corresponding to the keywords;
the processing module is used for carrying out word segmentation processing on the article to obtain a word segmentation file of the article; wherein the word segmentation file comprises a plurality of word sequences of the article;
the calculation module is used for comparing each word sequence in the word segmentation file with the keywords with the appointed tone respectively and calculating the similarity between the word sequence and the keywords;
the selecting module is used for selecting the word sequences with the highest similarity in the specified number as target keywords;
and the second searching module is used for continuously searching according to the target keyword to obtain the articles corresponding to the target keyword until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database.
With reference to the second aspect, an embodiment of the present invention provides a first possible implementation manner of the second aspect, where the computing module further includes:
inputting the word segmentation file into a pre-trained word training model to output a word vector of each word sequence;
and respectively calculating the similarity between the word sequence and the keyword through the word vector and the keyword vector of the keyword with specified tone.
With reference to the first possible implementation manner of the second aspect, an embodiment of the present invention provides a second possible implementation manner of the second aspect, where after the word segmentation file is input to a pre-trained word training model, the apparatus further includes:
outputting article vectors of articles corresponding to the word segmentation files through the word training model trained in advance;
and calculating the similarity between the searched article and the article stored in the article database according to the article vector so as to judge the repeatability of the searched article.
In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for acquiring an article according to the first aspect when executing the computer program.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method for acquiring an article according to the first aspect.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a method and a device for acquiring an article and electronic equipment, wherein the method comprises the following steps: acquiring a keyword of specified tone; searching according to the keywords to obtain articles corresponding to the keywords; performing word segmentation processing on the article to obtain word segmentation files of the article; the word segmentation file comprises a plurality of word sequences of the article; comparing each word sequence in the word segmentation file with the keywords with the appointed tone respectively, and calculating the similarity between the word sequence and the keywords; selecting a word sequence with the highest similarity and the specified number as a target keyword; and continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database. The method and the device automatically acquire the article with the user appointed tone through the crawler technology, have high efficiency and improve the experience of the user.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a method for acquiring an article according to an embodiment of the present invention;
fig. 2 is a flowchart of another method for obtaining an article according to an embodiment of the present invention;
fig. 3 is a flowchart of another method for obtaining an article according to an embodiment of the present invention;
FIG. 4 is a flowchart of another method for obtaining an article according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an apparatus for acquiring an article according to an embodiment of the present invention.
Icon:
10-an acquisition module; 20-a first search module; 30-a processing module; 40-a calculation module; 50-selecting a module; 60-second search module.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, internet articles are rich in types, novel in content, huge in data volume, various emerging media websites are infinite, and various media content forms are different. Thus, different users have different reading needs, each user prefers to read different specific-tune articles and media, such as: energy-positive media content is generally easily recognized and accepted by the general public, and how to obtain articles of user-specified tonality through semantic analysis also becomes a main task of many media software.
In the field of natural language processing, text similarity has been a popular research direction. The existing method tool for similarity between languages and semantics is mainly a Word2Ve tool, and the tool cuts a sentence and puts all words into a library, and uses a neural network to perform unsupervised learning on Word vectors of the appearing words by a skip gram and a negative sampling method, so as to finally obtain Word vectors of corresponding words and description vectors of article contents, and can judge the similarity between two words or two articles by comparing the Word vectors or the content vectors.
However, the Word2Ve tool, while being able to get a user-specific text, has the disadvantage of inefficiency, which results in a poor reading experience for the user. In view of the technical problem, embodiments of the present invention provide a method and an apparatus for obtaining an article, and an electronic device, which automatically obtain an article with a user-specified tonality through a crawler technology.
In practical application, with the maturity of web crawler technology, the current high-performance crawler has more and more abundant forms, and can crawl all public website contents in the internet. The scheduling of the crawler uses the queue instead of the function self-calling, and since the function self-calling easily causes a recursive function which occupies a large amount of memory and reduces the system performance, the basic design of the crawler in the embodiment of the invention obeys the queue scheduling, and the method has the characteristic of high efficiency, thereby improving the experience degree of users.
To facilitate understanding of the embodiment, a detailed description is first provided below of a method for acquiring an article according to an embodiment of the present invention.
The first embodiment is as follows:
the embodiment of the invention provides a method for acquiring an article, which is applied to a server. Fig. 1 is a flowchart of a method for acquiring an article according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S102, obtaining keywords of appointed tone;
the style orientation and the literary way of an article are specified as the tone, such as: the keywords are words for expressing the appointed tone of the article, and for the appointed tone of the positive energy, the keywords can be words with high efficiency, effectiveness, learning and the like; for the appointed tone of the optimistic attitude, the keywords can be terms such as optimistic, active and lively; for the specified tone of the injury, the keyword may be a word such as deep, fallen, and melancholy, and in practical application, the user may automatically set the specified tone of the article and the keyword of the specified tone, which is not limited in the embodiment of the present invention.
Step S104, searching according to the keywords to obtain articles corresponding to the keywords;
and after setting the appointed tone and the keywords of the appointed tone according to the reading deviation of the user, searching according to the keywords by a crawler technology to obtain the article corresponding to the keywords. For example: automatically searching in a website according to the depression through a crawler technology, thereby obtaining an article corresponding to the depression, namely obtaining an article of a traumatic class.
Step S106, performing word segmentation processing on the article to obtain word segmentation files of the article; the word segmentation file comprises a plurality of word sequences of the article;
optionally, a word segmentation process is performed on an article, and the article content may be segmented into a plurality of individual words according to the rule of chinese, so as to obtain a word segmentation file of the article, for example: i love Beijing, and a word segmentation file can be obtained after word segmentation processing, wherein the word segmentation file comprises: i, love and beijing. In practical applications, specific word segmentation processing methods and rules may be set according to practical situations, which are not limited to be described in the embodiments of the present invention.
Step S108, comparing each word sequence in the word segmentation file with the keywords with the appointed tone, and calculating the similarity between the word sequence and the keywords;
specifically, after obtaining the word segmentation file, a word vector of each word sequence can be obtained, where the word vector is a multidimensional real number vector capable of describing words, and the higher the dimensionality is, the more finely the word meaning can be represented, and usually, the vector of the word can be obtained by using unsupervised learning based on a neural network or a regression algorithm. Similarly, a keyword vector can be obtained according to the keywords with specified tone, so that the similarity between the word sequence and the keywords can be calculated according to the word vector and the keyword vector. The similarity is usually expressed by an included angle between a word vector and a keyword vector, and the smaller the included angle is, the higher the similarity is, whereas the larger the included angle is, the smaller the similarity is.
Step S110, selecting a word sequence with highest similarity and a specified number as a target keyword;
specifically, the word sequences with the highest similarity in the specified number are substituted for the keywords with the specified tone, so as to obtain the target keywords, and the target keywords are searched again through the crawler technology.
And step S112, continuing to search according to the target keywords to obtain the articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database.
At the moment, the articles corresponding to the target keywords are obtained by searching again on the designated website through the crawler technology, and when the number of the articles reaches a preset threshold value, the searched articles are stored in an article database and displayed so that a user can read the searched articles. Therefore, the method and the device for scheduling the articles automatically acquire the articles with the designated tonality through the crawler technology, and have the characteristic of high efficiency due to the fact that the crawler technology is basically designed to obey queue scheduling, and therefore the experience degree of a user is improved.
The method for acquiring the article provided by the embodiment of the invention comprises the following steps: acquiring a keyword of specified tone; searching according to the keywords to obtain articles corresponding to the keywords; performing word segmentation processing on the article to obtain word segmentation files of the article; the word segmentation file comprises a plurality of word sequences of the article; comparing each word sequence in the word segmentation file with the keywords with the appointed tone respectively, and calculating the similarity between the word sequence and the keywords; selecting a word sequence with the highest similarity and the specified number as a target keyword; and continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database. The method and the device automatically acquire the article with the user appointed tone through the crawler technology, have high efficiency and improve the experience of the user.
On the basis of fig. 1, another method for acquiring an article is provided in the embodiment of the present invention, and fig. 2 is a flowchart of another method for acquiring an article, which is provided in the embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:
step S202, obtaining keywords of appointed tone;
step S204, searching according to the keywords to obtain articles corresponding to the keywords;
step S206, performing word segmentation processing on the article to obtain word segmentation files of the article; the word segmentation file comprises a plurality of word sequences of the article;
step S202 to step S206 may refer to step S102 to step S106, and the detailed description of the embodiments of the present invention is omitted here.
Step S208, inputting the word segmentation file into a pre-trained word training model to output a word vector of each word sequence;
specifically, the word segmentation file is input into a word training model trained in advance, so that a word vector of each word sequence is obtained, and the similarity between the word sequence and the keyword is calculated through the word vector and the keyword vector.
In addition, after the word segmentation file is input into the pre-trained word training model, the embodiment of the invention further comprises the following steps: outputting article vectors of articles corresponding to the word segmentation files through the pre-trained word training model; and calculating the similarity between the searched article and the stored article in the article database according to the article vector so as to judge the repeatability of the searched article. The article vector is similar to the word vector and is used for describing a multi-dimensional real number vector of article contents.
Step S210, respectively calculating the similarity between the word sequence and the keywords through the word vectors and the keyword vectors of the keywords with specified tone;
specifically, according to the included angle between the word vector and the keyword vector, the similarity between the word sequence and the keyword can be obtained. Similarly, the similarity between the searched article and the article stored in the article database can be obtained through the included angle between the article vector and the vector of the article stored in the article database, so that the repeatability judgment can be performed on the searched article, the repeated article is prevented from being sent to the user to be read, and the reading experience of the user is improved.
Step S212, selecting word sequences with the highest similarity and the specified number as target keywords;
and step S214, continuing to search according to the target keywords to obtain the articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database.
Further, on the basis of fig. 1, another method for acquiring an article is provided in the embodiment of the present invention, and fig. 3 is a flowchart of another method for acquiring an article, as shown in fig. 3, the method includes the following steps:
step S302, obtaining keywords of appointed tone;
step S304, acquiring a designated website address input by a user;
step S306, searching on the website corresponding to the specified website address according to the keywords to obtain an article corresponding to the keywords;
specifically, firstly, inputting a keyword into a preset crawler program; and then, searching on a website corresponding to the specified website address according to the keywords through a crawler program to obtain an article corresponding to the keywords. In practical application, after obtaining the keyword with the designated tone, the user inputs the designated website address, and performs automatic search on the website corresponding to the designated website address according to the keyword through a crawler program, where the designated website address may be all internet addresses or a website address used for reading articles individually, which is not limited in the embodiments of the present invention.
Step S308, performing word segmentation processing on the article to obtain word segmentation files of the article; the word segmentation file comprises a plurality of word sequences of the article;
step S310, comparing each word sequence in the word segmentation file with the keywords with the appointed tone, and calculating the similarity between the word sequence and the keywords;
step S312, selecting the word sequences with the highest similarity and the specified number as target keywords;
and step S314, continuing to search according to the target keywords to obtain the articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database.
Step S308 to step S314 may refer to step S106 to step S112, and the detailed description of the embodiment of the present invention is omitted here.
This is illustrated here for ease of understanding. As shown in fig. 4, the user sets to read the positive energy article, and first sets the positive energy keywords, such as: efficient and learning words, and the like, while setting addresses of specified websites, such as: the website address of the current mainstream article; and configuring initial parameters of the crawler program; then, searching on a website corresponding to the address of the specified website according to the keywords through a crawler program to obtain an article related to the keywords with positive energy, and downloading the searched article to a local database; at the moment, performing word segmentation processing on the searched article to obtain a word segmentation file, wherein the word segmentation file comprises a plurality of word sequences of the article; inputting the word segmentation files into a word training model trained in advance to obtain a word vector of each word sequence, simultaneously obtaining article vectors of articles corresponding to the word segmentation files, respectively calculating the similarity between the word sequences and the keywords and the similarity between the searched articles and the articles stored in an article database, and storing the articles in a storage; at the moment, the word sequences with the highest similarity and the specified number are selected as target keywords to replace the original set positive-energy keywords, the websites corresponding to the specified website addresses are searched again through a crawler program until the number of the articles reaches a preset threshold value, and the searched articles are stored in an article database. The preset threshold may be 3, or may be any value, and may be set according to an actual situation, which is not limited to be described in the embodiment of the present invention.
On the basis of the above embodiment, the embodiment of the present invention further provides an article acquisition apparatus, which is applied to a server. Fig. 5 is a schematic diagram of an apparatus for acquiring an article according to an embodiment of the present invention, as shown in fig. 5, the apparatus includes:
an obtaining module 10, configured to obtain a keyword for specifying a tone;
the first search module 20 is configured to perform a search according to the keyword to obtain an article corresponding to the keyword;
the processing module 30 is configured to perform word segmentation processing on the article to obtain a word segmentation file of the article; the word segmentation file comprises a plurality of word sequences of the article;
the calculation module 40 is used for comparing each word sequence in the word segmentation file with the keywords with the appointed tone respectively and calculating the similarity between the word sequence and the keywords;
a selecting module 50, configured to select a word sequence with a highest similarity and a specified number as a target keyword;
the second searching module 60 is configured to continue to search according to the target keyword to obtain the articles corresponding to the target keyword until the number of the articles reaches a preset threshold, and store the searched articles in an article database.
Further, the calculating module 40 further includes:
inputting the word segmentation file into a pre-trained word training model to output a word vector of each word sequence;
and respectively calculating the similarity between the word sequence and the keyword through the word vector and the keyword vector of the keyword with specified tone.
Further, after the word segmentation document is input into the pre-trained word training model, the device further comprises:
outputting article vectors of articles corresponding to the word segmentation files through a pre-trained word training model;
and calculating the similarity between the searched article and the stored article in the article database according to the article vector so as to judge the repeatability of the searched article.
The device for acquiring the article provided by the embodiment of the invention comprises: acquiring a keyword of specified tone; searching according to the keywords to obtain articles corresponding to the keywords; performing word segmentation processing on the article to obtain word segmentation files of the article; the word segmentation file comprises a plurality of word sequences of the article; comparing each word sequence in the word segmentation file with the keywords with the appointed tone respectively, and calculating the similarity between the word sequence and the keywords; selecting a word sequence with the highest similarity and the specified number as a target keyword; and continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database. The method and the device automatically acquire the article with the user appointed tone through the crawler technology, have high efficiency and improve the experience of the user.
The embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the steps of the method for acquiring an article provided in the above embodiment are implemented.
The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for acquiring the article in the embodiment are executed.
The computer program product provided in the embodiment of the present invention includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, which is not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for obtaining an article, which is applied to a server, the method comprising:
acquiring a keyword of specified tone;
searching according to the keywords to obtain articles corresponding to the keywords;
performing word segmentation processing on the article to obtain a word segmentation file of the article; wherein the word segmentation file comprises a plurality of word sequences of the article;
comparing each word sequence in the word segmentation file with the keyword with the appointed tone respectively, and calculating the similarity between the word sequence and the keyword;
selecting the word sequences with the highest similarity and the specified number as target keywords;
and continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database.
2. The method of claim 1, wherein the step of comparing each word sequence in the segmentation document with the keyword with a specified tone respectively and calculating the similarity between the word sequence and the keyword comprises:
inputting the word segmentation file into a pre-trained word training model to output a word vector of each word sequence;
and respectively calculating the similarity between the word sequence and the keyword through the word vector and the keyword vector of the keyword with specified tone.
3. The method of claim 2, wherein after inputting the segmentation document into a pre-trained word training model, the method further comprises:
outputting article vectors of articles corresponding to the word segmentation files through the word training model trained in advance;
and calculating the similarity between the searched article and the article stored in the article database according to the article vector so as to judge the repeatability of the searched article.
4. The method for obtaining articles according to claim 1, wherein the step of searching according to the keyword to obtain the article corresponding to the keyword comprises:
acquiring a designated website address input by a user;
and searching on the website corresponding to the specified website address according to the keyword to obtain an article corresponding to the keyword.
5. The method of claim 4, wherein the step of searching for the article corresponding to the keyword on the website corresponding to the specified website address according to the keyword to obtain the article corresponding to the keyword further comprises:
inputting the keywords into a preset crawler program;
and searching on the website corresponding to the specified website address according to the keyword through the crawler program to obtain the article corresponding to the keyword.
6. An apparatus for obtaining an article, applied to a server, the apparatus comprising:
the acquisition module is used for acquiring keywords of specified tone;
the first search module is used for searching according to the keywords to obtain articles corresponding to the keywords;
the processing module is used for carrying out word segmentation processing on the article to obtain a word segmentation file of the article; wherein the word segmentation file comprises a plurality of word sequences of the article;
the calculation module is used for comparing each word sequence in the word segmentation file with the keywords with the appointed tone respectively and calculating the similarity between the word sequence and the keywords;
the selecting module is used for selecting the word sequences with the highest similarity in the specified number as target keywords;
and the second searching module is used for continuously searching according to the target keyword to obtain the articles corresponding to the target keyword until the number of the articles reaches a preset threshold value, and storing the searched articles in an article database.
7. The apparatus for obtaining an article according to claim 6, wherein the computing module further comprises:
inputting the word segmentation file into a pre-trained word training model to output a word vector of each word sequence;
and respectively calculating the similarity between the word sequence and the keyword through the word vector and the keyword vector of the keyword with specified tone.
8. The apparatus for retrieving an article according to claim 7, wherein after inputting the segmentation document into a word training model trained in advance, the apparatus further comprises:
outputting article vectors of articles corresponding to the word segmentation files through the word training model trained in advance;
and calculating the similarity between the searched article and the article stored in the article database according to the article vector so as to judge the repeatability of the searched article.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of obtaining an article of any of claims 1-5 when executing the computer program.
10. A computer-readable storage medium, having stored thereon a computer program for performing the steps of the method of obtaining an article of any one of claims 1-5 when executed by a processor.
CN201911422953.6A 2019-12-30 2019-12-30 Method and device for acquiring article and electronic equipment Active CN111159361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911422953.6A CN111159361B (en) 2019-12-30 2019-12-30 Method and device for acquiring article and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911422953.6A CN111159361B (en) 2019-12-30 2019-12-30 Method and device for acquiring article and electronic equipment

Publications (2)

Publication Number Publication Date
CN111159361A true CN111159361A (en) 2020-05-15
CN111159361B CN111159361B (en) 2023-10-20

Family

ID=70560716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911422953.6A Active CN111159361B (en) 2019-12-30 2019-12-30 Method and device for acquiring article and electronic equipment

Country Status (1)

Country Link
CN (1) CN111159361B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651674A (en) * 2020-06-03 2020-09-11 北京妙医佳健康科技集团有限公司 Bidirectional searching method and device and electronic equipment
CN112765962A (en) * 2021-01-15 2021-05-07 上海微盟企业发展有限公司 Text error correction method, device and medium
CN112784042A (en) * 2021-01-12 2021-05-11 北京明略软件系统有限公司 Text similarity calculation method and system combining article structure and aggregated word vector
CN113176878A (en) * 2021-06-30 2021-07-27 深圳市维度数据科技股份有限公司 Automatic query method, device and equipment
CN115329051A (en) * 2022-10-17 2022-11-11 成都大学 Multi-view news information rapid retrieval method, system, storage medium and terminal

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
CN101059806A (en) * 2007-06-06 2007-10-24 华东师范大学 Word sense based local file searching method
CN104516902A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Semantic information acquisition method and corresponding keyword extension method and search method
CN105095203A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Methods for determining and searching synonym, and server
CN110019669A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of text searching method and device
CN111143516A (en) * 2019-12-30 2020-05-12 广州探途网络技术有限公司 Article search result display method and related device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
CN101059806A (en) * 2007-06-06 2007-10-24 华东师范大学 Word sense based local file searching method
CN104516902A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Semantic information acquisition method and corresponding keyword extension method and search method
CN105095203A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Methods for determining and searching synonym, and server
CN110019669A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of text searching method and device
CN111143516A (en) * 2019-12-30 2020-05-12 广州探途网络技术有限公司 Article search result display method and related device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王卫国;徐炜民;: "基于潜在语义分析的个性化查询扩展模型" *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651674A (en) * 2020-06-03 2020-09-11 北京妙医佳健康科技集团有限公司 Bidirectional searching method and device and electronic equipment
CN111651674B (en) * 2020-06-03 2023-08-25 北京妙医佳健康科技集团有限公司 Bidirectional searching method and device and electronic equipment
CN112784042A (en) * 2021-01-12 2021-05-11 北京明略软件系统有限公司 Text similarity calculation method and system combining article structure and aggregated word vector
CN112765962A (en) * 2021-01-15 2021-05-07 上海微盟企业发展有限公司 Text error correction method, device and medium
CN113176878A (en) * 2021-06-30 2021-07-27 深圳市维度数据科技股份有限公司 Automatic query method, device and equipment
CN115329051A (en) * 2022-10-17 2022-11-11 成都大学 Multi-view news information rapid retrieval method, system, storage medium and terminal
CN115329051B (en) * 2022-10-17 2022-12-20 成都大学 Multi-view news information rapid retrieval method, system, storage medium and terminal

Also Published As

Publication number Publication date
CN111159361B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN111159361B (en) Method and device for acquiring article and electronic equipment
CN106874292B (en) Topic processing method and device
CN110569496B (en) Entity linking method, device and storage medium
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN112395506A (en) Information recommendation method and device, electronic equipment and storage medium
CN110321537B (en) Method and device for generating file
CN106708929B (en) Video program searching method and device
Han et al. Dancelets mining for video recommendation based on dance styles
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN110019669B (en) Text retrieval method and device
Shawon et al. Website classification using word based multiple n-gram models and random search oriented feature parameters
CN105404674A (en) Knowledge-dependent webpage information extraction method
KR20220119745A (en) Methods for retrieving content, devices, devices and computer-readable storage media
Zubiaga et al. Content-based clustering for tag cloud visualization
CN110263127A (en) Text search method and device is carried out based on user query word
CN104657376A (en) Searching method and searching device for video programs based on program relationship
CN111125348A (en) Text abstract extraction method and device
CN112905768A (en) Data interaction method, device and storage medium
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN106570196B (en) Video program searching method and device
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
Wei et al. Online education recommendation model based on user behavior data analysis
CN113590811A (en) Text abstract generation method and device, electronic equipment and storage medium
JP7395377B2 (en) Content search methods, devices, equipment, and storage media

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant