CN113032436A - Searching method and device based on article content and title - Google Patents

Searching method and device based on article content and title Download PDF

Info

Publication number
CN113032436A
CN113032436A CN202110412837.7A CN202110412837A CN113032436A CN 113032436 A CN113032436 A CN 113032436A CN 202110412837 A CN202110412837 A CN 202110412837A CN 113032436 A CN113032436 A CN 113032436A
Authority
CN
China
Prior art keywords
article
matched
articles
search
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110412837.7A
Other languages
Chinese (zh)
Other versions
CN113032436B (en
Inventor
姚鑫
白杰
白会杰
宋东瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Zhenxuan Data Information Technology Co ltd
Original Assignee
Suzhou Zhenxuan Data Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Zhenxuan Data Information Technology Co ltd filed Critical Suzhou Zhenxuan Data Information Technology Co ltd
Priority to CN202110412837.7A priority Critical patent/CN113032436B/en
Publication of CN113032436A publication Critical patent/CN113032436A/en
Application granted granted Critical
Publication of CN113032436B publication Critical patent/CN113032436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a searching method and a device based on article content and titles, wherein the method comprises the following steps: storing article data by using a search system in a distributed storage mode, wherein the search system is realized by using an elastic search; when an article searching request of a user terminal is received, the article content and the title are searched in real time in a searching system; the method comprises the steps of aggregating the matched data during retrieval by taking articles as objects, and highlighting the matched articles on a user terminal in a paging display mode so as to solve the problem that the retrieved data cannot be aggregated in the prior art.

Description

Searching method and device based on article content and title
Technical Field
The invention relates to the field of big data search, in particular to a search method and a search device based on article contents and titles.
Background
With the progress of computer technology, big data search has been rapidly developed. The current big data search is realized based on Mysql, Solr, elastic search, Hermes and other technologies. The Solr and the ElasticSearch focus on searching and full-text retrieval, the data scale can reach millions to millions, the Solr utilizes the Zookeeper to perform distributed management, supports data in various formats, is widely used in traditional search application, has low efficiency for processing real-time search application, supports data in json format, has a distributed coordination management function, and has higher efficiency than the Solr when processing the real-time search application. Hermes is a massive data real-time retrieval and analysis platform based on a large-index technology, data analysis is emphasized, and the data scale is different from hundreds of millions to trillions.
In the scheme, the Mysql full-text retrieval efficiency is low, the result matching correlation degree is low, the word segmentation retrieval cannot be completed, and the product use experience sense is biased; the Solr processing real-time application efficiency is low; hermes focuses on data analysis and is relatively inefficient in search and full-text retrieval; the search of the ElasticSearch is based on data, and multiple data matched with the search cannot be aggregated into the same data in an object-oriented mode during aggregation.
Disclosure of Invention
The invention mainly aims to provide a searching method and a searching device based on article contents and titles, which are used for solving the problem that the retrieved data cannot be aggregated in the prior art.
In order to achieve the above object, according to an aspect of the present invention, there is provided a search method based on article contents and titles, including: storing article data by using a search system in a distributed storage mode, wherein the search system is realized by using an elastic search; when an article searching request of a user terminal is received, the article content and the title are searched in real time in a searching system; and aggregating the matched data during retrieval by taking the articles as objects, and highlighting the matched articles on the user terminal in a paging display mode.
Optionally, the storing article data by using a search system in a distributed storage manner includes: the whole article is split according to the title and the paragraph and then is stored in a distributed storage system in a standard data structure mode, wherein the standard data structure comprises the following fields: the method comprises the following steps of identifying the belonged article, the article type of the belonged article, the source of the belonged article, the article name and content of the belonged article, the URL of the belonged article, whether the article is the title of the article, the release time of the belonged article, the data generation time and the storage time.
Optionally, when receiving an article search request, the real-time retrieval of article contents and titles in the search system includes: according to keywords in an article search request, performing real-time retrieval of article contents and titles in a search system to obtain a search result, wherein the search result comprises a single article or a plurality of articles, and the single article comprises the following situations: a single article matching only the title, a single article matching only the paragraph, and a single article matching both the title and the paragraph, the plurality of articles including the following situations: the method is characterized in that a plurality of articles which are all titles are matched, a part of the articles are matched with the titles and the paragraphs at the same time, a part of the articles are matched with the titles only, and another part of the articles (different from a part of the articles described before and can be a part or all of the articles left after the part of the articles) are matched with the paragraphs only.
Optionally, aggregating the data matched during the retrieval with the article as an object includes: for the data matched with the title, aggregating the data with the same identification of the article; and for the data only matched with the paragraph, aggregating the data with the same identification of the article to which the data belongs.
Optionally, highlighting the matched article on the user terminal by using a paging display mode includes: calculating the relevance between the matched articles and keywords in the article search request by adopting a relevance algorithm; and highlighting the matched articles on the user terminal according to the relevance from large to small by adopting a paging display mode.
Optionally, highlighting the matched article from large to small according to the relevance on the user terminal includes: highlighting the matched articles from large to small according to the relevance according to a preset display configuration on the user terminal, wherein the preset display configuration is that only the title is displayed or the title and the paragraph are displayed at the same time.
Optionally, after highlighting the matched article in a paging display manner, in the case of receiving a search request from the user terminal, if a keyword in the received search request is the same as a keyword in the article search request, returning the same search result as the previous search result to the user terminal.
In order to achieve the above object, according to an aspect of the present invention, there is also provided a search apparatus based on article contents and titles, including: the article data storage unit is used for storing the article data by utilizing a search system in a distributed storage mode, wherein the search system is realized by adopting an elastic search; the search unit is used for searching article contents and titles in real time in the search system when receiving an article search request of the user terminal; and the display unit is used for aggregating the matched data during retrieval by taking the articles as objects and highlighting the matched articles on the user terminal in a paging display mode.
Optionally, the storage unit is further configured to: the whole article is split according to the title and the paragraph and then is stored in a distributed storage system in a standard data structure mode, wherein the standard data structure comprises the following fields: the method comprises the following steps of identifying the belonged article, the article type of the belonged article, the source of the belonged article, the article name and content of the belonged article, the URL of the belonged article, whether the article is the title of the article, the release time of the belonged article, the data generation time and the storage time.
Optionally, the search unit is further configured to: according to keywords in an article search request, performing real-time retrieval of article contents and titles in a search system to obtain a search result, wherein the search result comprises a single article or a plurality of articles, and the single article comprises the following situations: a single article matching only the title, a single article matching only the paragraph, and a single article matching both the title and the paragraph, the plurality of articles including the following situations: the matched articles and partial articles are simultaneously matched with the titles and the paragraphs, the partial articles are only matched with the titles, and the partial articles are only matched with the paragraphs.
Optionally, the display unit is further configured to: for the data matched with the title, aggregating the data with the same identification of the article; and for the data only matched with the paragraph, aggregating the data with the same identification of the article to which the data belongs.
Optionally, the display unit is further configured to: calculating the relevance between the matched articles and keywords in the article search request by adopting a relevance algorithm; and highlighting the matched articles on the user terminal according to the relevance from large to small by adopting a paging display mode.
Optionally, the display unit is further configured to: highlighting the matched articles from large to small according to the relevance according to a preset display configuration on the user terminal, wherein the preset display configuration is that only the title is displayed or the title and the paragraph are displayed at the same time.
Optionally, the apparatus of the present application may further comprise: and the response unit is used for, after the matched article is highlighted in a paging display mode, returning the same search result as the previous search result to the user terminal if the keyword in the received search request is the same as the keyword in the article search request under the condition that the search request of the user terminal is received.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program which, when executed, performs the above-described method.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above method through the computer program.
By applying the technical scheme of the invention, million distributed storages are completed by adopting the elastic search; the method has the advantages that the ElasticSearch is adopted to complete real-time retrieval of article contents and titles, the matched data is aggregated by taking articles as objects, a certain number of article data are displayed in pages, and the matched data are highlighted, so that the problem that aggregation processing cannot be performed on the retrieved data in the prior art can be solved, and the user experience is improved.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 illustrates a flow chart of an alternative article content and title based search method in accordance with the present invention;
FIG. 2 is a schematic diagram illustrating search results of an optional article according to the present invention;
FIG. 3 is a schematic diagram of an alternative correlation calculation scheme in accordance with the present invention;
FIG. 4 is a schematic diagram of an alternative correlation calculation scheme in accordance with the present invention;
FIG. 5 is a schematic diagram of an alternative correlation calculation scheme in accordance with the present invention;
FIG. 6 is a schematic diagram illustrating an alternative correlation calculation result according to the present invention;
FIG. 7 is a schematic diagram of an alternative data node in accordance with the present invention;
FIG. 8 is a schematic diagram illustrating an alternative data retrieval in accordance with the present invention; and the number of the first and second groups,
FIG. 9 shows a schematic diagram of an alternative article search scheme in accordance with the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances for describing embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
According to an aspect of embodiments of the present application, there is provided an embodiment of a search method based on article content and titles. As shown in fig. 1:
and step S101, storing article data in a distributed storage mode by using a search system, wherein the search system is realized by using an elastic search.
Optionally, the storing article data by using a search system in a distributed storage manner includes: the whole article is split according to the title and the paragraph and then is stored in a distributed storage system in a standard data structure mode, wherein the standard data structure comprises the following fields: the method comprises the following steps of identifying the belonged article, the article type of the belonged article, the source of the belonged article, the article name and content of the belonged article, the URL of the belonged article, whether the article is the title of the article, the release time of the belonged article, the data generation time and the storage time.
Step S102, when receiving the article search request of the user terminal, the search system searches the article content and the title in real time.
Optionally, when receiving an article search request, the real-time retrieval of article contents and titles in the search system includes: according to keywords in the article search request, performing real-time retrieval of article contents and titles in the search system to obtain search results, wherein the search results comprise a single article or a plurality of articles, and the single article comprises the following situations: a single article matching only the title, a single article matching only the paragraph, and a single article matching both the title and the paragraph, the plurality of articles including the following situations: the method comprises the steps that a plurality of articles which are all titles are matched, a part of the articles are matched with the titles and the paragraphs at the same time, the part of the articles are matched with the titles only, and the part of the articles are matched with the paragraphs only.
And step S103, aggregating the matched data in the searching process by taking the articles as objects, and highlighting the matched articles on the user terminal in a paging display mode.
Optionally, aggregating the data matched during the retrieval with the article as an object includes: for the data matched with the title, aggregating the data with the same identification of the article; and for the data only matched with the paragraph, aggregating the data with the same identification of the article to which the data belongs.
Optionally, highlighting the matched article on the user terminal by using a paging display mode includes: calculating the relevance between the matched articles and keywords in the article search request by adopting a relevance algorithm; and highlighting the matched articles on the user terminal according to the relevance from large to small by adopting a paging display mode.
Optionally, highlighting the matched article from large to small according to the relevance on the user terminal includes: highlighting the matched articles from large to small according to the relevance according to a preset display configuration on the user terminal, wherein the preset display configuration is that only the title is displayed or the title and the paragraph are displayed at the same time.
Optionally, after highlighting the matched article in a paging display manner, in the case of receiving a search request from the user terminal, if a keyword in the received search request is the same as a keyword in the article search request, returning the same search result as the previous search result to the user terminal.
In the related use case of the ElasticSearch, the ElasticSearch can be adopted to perform PB level search; starting a core search architecture based on the elastic search to provide timely and accurate music search service for a user; the method comprises the steps of using an ElasticSearch as text data analysis, collecting various index data and user-defined data on a server, and performing multi-dimensional analysis display on various data to assist in positioning analysis instance abnormity or service level abnormity; analyzing and processing hundreds of millions of real-time logs by using an ES (event service); the ES is used to build a log collection and analysis system. In conclusion, the scheme is based on the ElasticSearch to complete full-text retrieval based on article contents and titles in consideration of real-time searching and real-time data analysis of products.
In the scheme, million distributed storages are completed by adopting the elastic search; the method has the advantages that the ElasticSearch is adopted to complete real-time retrieval of article contents and titles, the matched data is aggregated by taking articles as objects, a certain number of article data are displayed in a paging mode, the matched data are highlighted, and the use experience of a user is improved. The technical solution of the present application is further detailed below with reference to specific embodiments:
step 1, creating a data structure, namely splitting a whole article according to titles and paragraphs, wherein the format of the data structure is shown in the following table 1.
TABLE 1
Figure BDA0003024590220000051
Figure BDA0003024590220000061
And 2, analyzing the user behavior.
In the search behavior analysis, the following cases are included for a single article, as shown in fig. 2: the search result is matched with only the title, the search result is matched with only the paragraph, and the search result is matched with the paragraph and the title.
Several cases are included for several articles: the matched data is the titles of different articles, partial articles are matched with both the titles and the paragraphs, partial articles are matched with only the titles or partial articles are matched with only the paragraphs.
The search filter field is sortText (i.e., a search key). The demonstration includes the following several cases: only headings, and paragraphs are shown.
Because different processing is performed on the titles and paragraphs as matching results, the searched results need to be subjected to istile judgment; the same articles are aggregated and the article data structure is as follows:
Figure BDA0003024590220000062
the structure of the final aggregated search results is as follows:
Figure BDA0003024590220000063
Figure BDA0003024590220000071
the implementation process of article aggregation:
data returned by the ES is acquired, and the technology aggregates articles and paragraphs by the paper Ids (the same paper Ids are the same articles) in the data structure.
And A, firstly matching the title, firstly matching the data matched with the title, and aggregating all the data with the same paperId.
Traversing the article titles, the url of the articles, the source names of the articles and the release time of the articles of all the articles, and binding the article paragraphs and the article titles (article ids of the titles) to obtain an article paragraph set, wherein for the paragraphs in the set, if the article Id of the titles is the same as the article Id of the paragraphs, the article paragraphs can be judged to be from the same article.
And B, matching paragraphs, wherein the search result is possibly matched with only paragraphs, and aggregating the data with matched titles according to the paperId again.
And 3, performing word segmentation analysis on the search text. The word segmentation mechanism used is shown in table 2.
TABLE 2
Character Filter Processing original text Example (c): removing html tags, special characters, etc
Tokenizer Segmenting original text into words Example (c): medical information->Medicine, information
Token Filters Processing the keywords after word segmentation Example (c): change to lower case, delete mood words, synonyms, and the like
And 4, calculating the correlation degree by using a correlation algorithm, as shown in FIG. 3.
In step 401, the phrase Alfred way to be queried is determined.
At step 402, the number of times TF (term frequency) that the keyword appears in each document Doc is determined.
Step 403, determining the frequency IDF (inverse document word frequency updated document frequency) of the keyword appearing in the whole index.
Step 404, determining a field length norm, wherein the longer the field length, the smaller the value.
In step 405, the final scoring result score (q, d) for a document doc is determined.
And step 406, converting the discrete-looking data into a similar interval by using querynorm (q) on the premise of not influencing the mutual relation so as to be more humanized.
Step 407, using coord (q, d) to score the matching result, and the more the matched document doc is scored, the more the matching score is calculated according to the inverted index of the user name username.
And step 408, summarizing the scores by using the sigma function to obtain the total weight of each item in the query by the document doc.
Step 409, using tf (t in d) to determine the square root of the number of times the item appears in the document doc.
Step 410, the weight value set with t.getboost ().
Step 411, norm (t, d), the longer the field length, the smaller the result.
The scheme can be realized by a TF/IDE model, as shown in FIG. 4, or by a BM25 model, as shown in FIG. 5, wherein the score pair of the TF/IDE model and the BM25 model is shown in FIG. 6.
And step 5, distributed storage, wherein in order to avoid the risk of data loss caused by single-point failure, the scheme adopts multi-node distributed storage, and the high availability of the system is improved through a master-slave design mode and a master-slave design mode.
The operational data node working scheme is shown in fig. 7:
1) the client sends a new creation, index or deletion request to NODE NODE 1 (i.e. MASTER NODE) in the CLUSTER CLUSTER.
2) The NODE uses the id of the document to determine that the document belongs to shard 0, which forwards the request to NODE 3, with shard 0 located at this NODE.
3) NODE 3 executes the request on the primary partition, and if successful, it forwards the request to the corresponding replication NODEs located at NODE 1 and NODE 2, and when all replication NODEs report success, NODE 3 reports success to the requesting NODE, which reports to the client.
The retrieval scheme is shown in FIG. 8:
1) the client sends a get request to NODE 1.
2) The nodes use the id of the document to determine that the document belongs to the fragment 0, and the copy fragments corresponding to the fragment 0 are all on the three nodes. At this point it forwards the request to NODE 2.
3) NODE 2 returns the document (document) to NODE 1 and then to the client.
For read requests, to balance the load, the requesting node will select a different shard for each request, which will cycle through all shard copies. It may be the case that an indexed document already exists on the primary partition but has not yet been synchronized to the duplicate partition. At this time, the copy fragment reports that the document is not found, and the main fragment successfully returns the document. Once the index request is successfully returned to the user, the document is available in both the master shard and the replica shard.
Optionally, the id registered by the user is used as a unique identifier, the data requested to be accessed by the user for the first time is associated with the user through a redis cache technology, and when the user initiates the same request again, the data is obtained from the cache, so that the response speed is increased again.
As shown in fig. 9, after the user registration, the user id and uri (short for Uniform Resource Identifier) are bound, and if all parameters of the accessed uri except the pageNo and pageSize are the same, the cached data is displayed to the user, otherwise, the ElasticSearch is accessed again.
In the technical scheme of the application, a cluster scheme is adopted, and the high availability of the system is improved through a master-slave mode and a master-slave mode; the method adopts the ElasticSearch inverted index principle, divides the search condition into words as much as possible, improves the matching correlation degree, displays the words to the user in a highlight form and improves the experience of the user; and highly aggregating the matched contents into an article by the thought facing the article object, and displaying the article to the user.
Due to the fact that sequencing is adopted for services, a default correlation algorithm of the ES is broken through, the filter is adopted as a query to replace match, because the filter is not sequenced according to score, the performance is high, and the filter can add cache after multiple times of same operations.
The scheme adopts an elastic search to complete millions of distributed storages; the method has the advantages that the ElasticSearch is adopted to complete real-time retrieval of article contents and titles, the matched data is aggregated by taking articles as objects, a certain number of article data are displayed in a paging mode, the matched data are highlighted, and the use experience of a user is improved.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
According to another aspect of the embodiment of the application, a searching device based on article content and titles is also provided. The method comprises the following steps: the article data storage unit is used for storing the article data by utilizing a search system in a distributed storage mode, wherein the search system is realized by adopting an elastic search; the search unit is used for searching article contents and titles in real time in the search system when receiving an article search request of the user terminal; and the display unit is used for aggregating the matched data during retrieval by taking the articles as objects and highlighting the matched articles on the user terminal in a paging display mode.
Optionally, the storage unit is further configured to: the whole article is split according to the title and the paragraph and then is stored in a distributed storage system in a standard data structure mode, wherein the standard data structure comprises the following fields: the method comprises the following steps of identifying the belonged article, the article type of the belonged article, the source of the belonged article, the article name and content of the belonged article, the URL of the belonged article, whether the article is the title of the article, the release time of the belonged article, the data generation time and the storage time.
Optionally, the search unit is further configured to: according to keywords in an article search request, performing real-time retrieval of article contents and titles in a search system to obtain a search result, wherein the search result comprises a single article or a plurality of articles, and the single article comprises the following situations: a single article matching only the title, a single article matching only the paragraph, and a single article matching both the title and the paragraph, the plurality of articles including the following situations: the matched articles and partial articles are simultaneously matched with the titles and the paragraphs, the partial articles are only matched with the titles, and the partial articles are only matched with the paragraphs.
Optionally, the display unit is further configured to: for the data matched with the title, aggregating the data with the same identification of the article; and for the data only matched with the paragraph, aggregating the data with the same identification of the article to which the data belongs.
Optionally, the display unit is further configured to: calculating the relevance between the matched articles and keywords in the article search request by adopting a relevance algorithm; and highlighting the matched articles on the user terminal according to the relevance from large to small by adopting a paging display mode.
Optionally, the display unit is further configured to: highlighting the matched articles from large to small according to the relevance according to a preset display configuration on the user terminal, wherein the preset display configuration is that only the title is displayed or the title and the paragraph are displayed at the same time.
Optionally, the apparatus of the present application may further comprise: and the response unit is used for, after the matched article is highlighted in a paging display mode, returning the same search result as the previous search result to the user terminal if the keyword in the received search request is the same as the keyword in the article search request under the condition that the search request of the user terminal is received.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program which, when executed, performs the above-described method.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above method through the computer program.
The relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description. Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate. In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Spatially relative terms, such as "above … …," "above … …," "above … …," "above," and the like, may be used herein for ease of description to describe one device or feature's spatial relationship to another device or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is turned over, devices described as "above" or "on" other devices or configurations would then be oriented "below" or "under" the other devices or configurations. Thus, the exemplary term "above … …" can include both an orientation of "above … …" and "below … …". The device may be otherwise variously oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.
In the description of the present invention, it is to be understood that the orientation or positional relationship indicated by the orientation words such as "front, rear, upper, lower, left, right", "lateral, vertical, horizontal" and "top, bottom", etc. are usually based on the orientation or positional relationship shown in the drawings, and are only for convenience of description and simplicity of description, and in the case of not making a reverse description, these orientation words do not indicate and imply that the device or element being referred to must have a specific orientation or be constructed and operated in a specific orientation, and therefore, should not be considered as limiting the scope of the present invention; the terms "inner and outer" refer to the inner and outer relative to the profile of the respective component itself.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for searching based on article content and titles is characterized by comprising the following steps:
storing article data by using a search system in a distributed storage mode, wherein the search system is realized by using an elastic search;
when an article searching request of a user terminal is received, searching article contents and titles in real time in the searching system;
and aggregating the matched data during retrieval by taking the articles as objects, and highlighting the matched articles on the user terminal in a paging display mode.
2. The method of claim 1, wherein storing article data in a distributed storage manner using a search system comprises:
the whole article is split according to the title and the paragraph and then is stored in a distributed storage system in a standard data structure mode, wherein the standard data structure comprises the following fields: the method comprises the following steps of identifying the belonged article, the article type of the belonged article, the source of the belonged article, the article name and content of the belonged article, the URL of the belonged article, whether the article is the title of the article, the release time of the belonged article, the data generation time and the storage time.
3. The method of claim 1, wherein performing real-time retrieval of article content and headlines in the search system upon receiving an article search request comprises:
according to keywords in the article search request, performing real-time retrieval of article contents and titles in the search system to obtain a search result, wherein the search result comprises a single article or a plurality of articles, and the single article comprises the following situations: a single article matching only the title, a single article matching only the paragraph, and a single article matching both the title and the paragraph, the plurality of articles including the following situations: the method comprises the steps that a plurality of articles which are all titles are matched, a part of the articles are matched with the titles and the paragraphs at the same time, the part of the articles are matched with the titles only, and the part of the articles are matched with the paragraphs only.
4. The method of claim 1, wherein aggregating the data matched during retrieval with articles as objects comprises:
for the data matched with the title, aggregating the data with the same identification of the article;
and for the data only matched with the paragraph, aggregating the data with the same identification of the article to which the data belongs.
5. The method of claim 1, wherein highlighting the matched article on the user terminal in a paginated presentation comprises:
calculating the relevance between the matched articles and keywords in the article search request by adopting a relevance algorithm;
and adopting a paging display mode to highlight the matched articles on the user terminal from big to small according to the relevance.
6. The method of claim 5, wherein highlighting the matched articles on the user terminal according to relevance from large to small comprises:
and highlighting the matched articles from large to small according to the relevance according to a preset display configuration on the user terminal, wherein the preset display configuration is that only the title is displayed or the title and the paragraph are displayed simultaneously.
7. The method of claim 1, wherein after highlighting the matched articles in a paginated presentation, the method further comprises:
and under the condition of receiving the search request of the user terminal, if the keyword in the received search request is the same as the keyword in the article search request, returning the same search result as the previous search result to the user terminal.
8. An article content and title based search apparatus, comprising:
the article data searching system comprises a storage unit, a searching unit and a searching unit, wherein the storage unit is used for storing article data by utilizing a searching system in a distributed storage mode, and the searching system is realized by adopting an elastic search;
the search unit is used for searching article contents and titles in real time in the search system when receiving an article search request of a user terminal;
and the display unit is used for aggregating the matched data during retrieval by taking the articles as objects and highlighting the matched articles on the user terminal in a paging display mode.
9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program when executed performs the method of any of the preceding claims 1 to 7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the method of any of the preceding claims 1 to 7 by means of the computer program.
CN202110412837.7A 2021-04-16 2021-04-16 Searching method and device based on article content and title Active CN113032436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110412837.7A CN113032436B (en) 2021-04-16 2021-04-16 Searching method and device based on article content and title

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110412837.7A CN113032436B (en) 2021-04-16 2021-04-16 Searching method and device based on article content and title

Publications (2)

Publication Number Publication Date
CN113032436A true CN113032436A (en) 2021-06-25
CN113032436B CN113032436B (en) 2022-05-31

Family

ID=76457366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110412837.7A Active CN113032436B (en) 2021-04-16 2021-04-16 Searching method and device based on article content and title

Country Status (1)

Country Link
CN (1) CN113032436B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116895337A (en) * 2023-09-07 2023-10-17 智菲科技集团有限公司 Synthetic biological element database system

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105404699A (en) * 2015-12-29 2016-03-16 广州神马移动信息科技有限公司 Method, device and server for searching articles of finance and economics
CN105447187A (en) * 2015-12-15 2016-03-30 广州神马移动信息科技有限公司 Webpage search method and system
CN106354759A (en) * 2016-08-18 2017-01-25 北京百迈客云科技有限公司 Retrieving and automatically downloading system of articles and data based on biological cloud platform
US20170068712A1 (en) * 2015-09-04 2017-03-09 Palantir Technologies Inc. Systems and methods for database investigation tool
CN106776878A (en) * 2016-11-29 2017-05-31 西安交通大学 A kind of method for carrying out facet retrieval to MOOC courses based on ElasticSearch
CN107958080A (en) * 2017-12-14 2018-04-24 上海特易信息科技有限公司 A kind of big data report processing method based on ElasticSearch
CN108509524A (en) * 2018-03-12 2018-09-07 上海哔哩哔哩科技有限公司 Method, server and the system of data processing of data processing
CN108932320A (en) * 2018-06-27 2018-12-04 广州优视网络科技有限公司 Article search method, apparatus and electronic equipment
CN109359173A (en) * 2018-10-24 2019-02-19 南京大学 A kind of search method of judgement document
CN109359142A (en) * 2018-09-29 2019-02-19 北京明朝万达科技股份有限公司 A kind of data processing method, data processing equipment, computer equipment and readable storage medium storing program for executing
CN109492148A (en) * 2018-11-22 2019-03-19 北京明朝万达科技股份有限公司 ElasticSearch paging query method and apparatus based on Redis
CN109933589A (en) * 2019-03-15 2019-06-25 北京计算机技术及应用研究所 The data structure conversion method based on ElasticSearch aminated polyepichlorohydrin result for data summarization
US20190384793A1 (en) * 2018-06-15 2019-12-19 EMC IP Holding Company LLC Methods, apparatuses, and computer storage media for data searching
CN110795458A (en) * 2019-10-08 2020-02-14 北京百分点信息科技有限公司 Interactive data analysis method, device, electronic equipment and computer readable storage medium
CN111143443A (en) * 2019-11-29 2020-05-12 数字广东网络建设有限公司 Method, device, system, terminal and storage medium for displaying government affair information
CN111522905A (en) * 2020-04-15 2020-08-11 武汉灯塔之光科技有限公司 Document searching method and device based on database
CN112148885A (en) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 Intelligent searching method and system based on knowledge graph
CN112330466A (en) * 2020-10-30 2021-02-05 泰康保险集团股份有限公司 Online monitoring method and device for medical insurance fund illegal operation event
CN112463816A (en) * 2020-11-23 2021-03-09 上海好屋网信息技术有限公司 API-based query system and method

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170068712A1 (en) * 2015-09-04 2017-03-09 Palantir Technologies Inc. Systems and methods for database investigation tool
CN105447187A (en) * 2015-12-15 2016-03-30 广州神马移动信息科技有限公司 Webpage search method and system
CN105404699A (en) * 2015-12-29 2016-03-16 广州神马移动信息科技有限公司 Method, device and server for searching articles of finance and economics
CN106354759A (en) * 2016-08-18 2017-01-25 北京百迈客云科技有限公司 Retrieving and automatically downloading system of articles and data based on biological cloud platform
CN106776878A (en) * 2016-11-29 2017-05-31 西安交通大学 A kind of method for carrying out facet retrieval to MOOC courses based on ElasticSearch
CN107958080A (en) * 2017-12-14 2018-04-24 上海特易信息科技有限公司 A kind of big data report processing method based on ElasticSearch
CN108509524A (en) * 2018-03-12 2018-09-07 上海哔哩哔哩科技有限公司 Method, server and the system of data processing of data processing
US20190384793A1 (en) * 2018-06-15 2019-12-19 EMC IP Holding Company LLC Methods, apparatuses, and computer storage media for data searching
CN108932320A (en) * 2018-06-27 2018-12-04 广州优视网络科技有限公司 Article search method, apparatus and electronic equipment
CN109359142A (en) * 2018-09-29 2019-02-19 北京明朝万达科技股份有限公司 A kind of data processing method, data processing equipment, computer equipment and readable storage medium storing program for executing
CN109359173A (en) * 2018-10-24 2019-02-19 南京大学 A kind of search method of judgement document
CN109492148A (en) * 2018-11-22 2019-03-19 北京明朝万达科技股份有限公司 ElasticSearch paging query method and apparatus based on Redis
CN109933589A (en) * 2019-03-15 2019-06-25 北京计算机技术及应用研究所 The data structure conversion method based on ElasticSearch aminated polyepichlorohydrin result for data summarization
CN110795458A (en) * 2019-10-08 2020-02-14 北京百分点信息科技有限公司 Interactive data analysis method, device, electronic equipment and computer readable storage medium
CN111143443A (en) * 2019-11-29 2020-05-12 数字广东网络建设有限公司 Method, device, system, terminal and storage medium for displaying government affair information
CN111522905A (en) * 2020-04-15 2020-08-11 武汉灯塔之光科技有限公司 Document searching method and device based on database
CN112148885A (en) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 Intelligent searching method and system based on knowledge graph
CN112330466A (en) * 2020-10-30 2021-02-05 泰康保险集团股份有限公司 Online monitoring method and device for medical insurance fund illegal operation event
CN112463816A (en) * 2020-11-23 2021-03-09 上海好屋网信息技术有限公司 API-based query system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
宿大东: "搜索引擎在构建图书馆网站统一检索系统中的应用", 《科技情报开发与经济》 *
童明 等: "基于Elasticsearch的高校无线网日志分析", 《武汉理工大学学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116895337A (en) * 2023-09-07 2023-10-17 智菲科技集团有限公司 Synthetic biological element database system
CN116895337B (en) * 2023-09-07 2023-11-17 智菲科技集团有限公司 Synthetic biological element database system

Also Published As

Publication number Publication date
CN113032436B (en) 2022-05-31

Similar Documents

Publication Publication Date Title
CN108509547B (en) Information management method, information management system and electronic equipment
CN108304444B (en) Information query method and device
US8775442B2 (en) Semantic search using a single-source semantic model
JP5721818B2 (en) Use of model information group in search
CN102722481B (en) The processing method of a kind of user's favorites data and searching method
Shinzato et al. Tsubaki: An open search engine infrastructure for developing information access methodology
EP3522029A1 (en) Natural language search results for intent queries
US7849070B2 (en) System and method for dynamically ranking items of audio content
US20070033229A1 (en) System and method for indexing structured and unstructured audio content
US20150154306A1 (en) Method for searching related entities through entity co-occurrence
US20160162574A1 (en) Computer-implemented method of and system for searching an inverted index having a plurality of posting lists
CN103425687A (en) Retrieval method and system based on queries
CN107943952A (en) A kind of implementation method that full-text search is carried out based on Spark frames
CN108509437A (en) A kind of ElasticSearch inquiries accelerated method
CN112100396A (en) Data processing method and device
CN113032436B (en) Searching method and device based on article content and title
US20070033199A1 (en) System and method for accessing preferred provider of audio content
CN111563112A (en) Data search and display system based on cross-border trade big data
CN116431895A (en) Personalized recommendation method and system for safety production knowledge
CN101887438A (en) Method and equipment for determining principle of optimality of search engine of webpage
US9122748B2 (en) Matching documents against monitors
CN103891244B (en) A kind of method and device carrying out data storage and search
CN113505172A (en) Data processing method and device, electronic equipment and readable storage medium
CN110704421A (en) Data processing method, device, equipment and computer readable storage medium
US8930373B2 (en) Searching with exclusion tokens

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant