CN111339391A - Novel search method based on public data - Google Patents

Novel search method based on public data Download PDF

Info

Publication number
CN111339391A
CN111339391A CN202010169574.7A CN202010169574A CN111339391A CN 111339391 A CN111339391 A CN 111339391A CN 202010169574 A CN202010169574 A CN 202010169574A CN 111339391 A CN111339391 A CN 111339391A
Authority
CN
China
Prior art keywords
entity
data
multimedia
retrieval
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010169574.7A
Other languages
Chinese (zh)
Other versions
CN111339391B (en
Inventor
李林亮
张盛泽
王娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Andlinks Data Technology Co ltd
Original Assignee
Nanjing Andlinks Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Andlinks Data Technology Co ltd filed Critical Nanjing Andlinks Data Technology Co ltd
Priority to CN202010169574.7A priority Critical patent/CN111339391B/en
Publication of CN111339391A publication Critical patent/CN111339391A/en
Application granted granted Critical
Publication of CN111339391B publication Critical patent/CN111339391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a novel searching method based on public data, which comprises the following steps: crawling content corresponding to the entry information and multimedia data associated with the entity on the Internet according to the pre-arranged entry information, and analyzing to obtain an attribute list of the entry information and a multimedia list associated with the entity; adopting a jena tool to perform data modeling so as to merge the attributes of the entries of the same category, and analyzing the contents of the merged attributes to obtain the relationship between the entries; initializing retrieval service data, performing semantic analysis on query sentences by adopting an NLP tool, performing word segmentation identification on fuzzy sentences, performing keyword retrieval on a WEB search service page according to word segmentation results, and returning retrieval results. The invention not only supports the traditional keyword content retrieval, but also supports the picture retrieval, the semantic retrieval and the condition retrieval, and can show the entity and the relation graph in detail through the map, so that the data retrieval function is stronger.

Description

Novel search method based on public data
Technical Field
The invention relates to the technical field of electronic information, in particular to a novel searching method based on public data.
Background
With the development of computer technology and information technology, paper encyclopedias gradually quit the sight of the invention, and online encyclopedias are slowly known by the invention. Hundred degree companies released formal versions of hundred degree encyclopedia in 2008, 4, 21, and up to now, 1600 ten thousand entries have been recorded, 700 million people have participated in the editing of the hundred degree entries, and the hundred degree encyclopedia almost covers all known knowledge fields. The encyclopedia aims to create a Chinese information collection platform covering knowledge in various fields, and the encyclopedia is open and open, so that objective facts are often shown.
The Baidu encyclopedia is an information platform and a search platform, and can be used for recording information and retrieving the recorded information. When the Baidu encyclopedia search platform is used for searching, only simple word segmentation is used for searching, and the search function is not strong enough.
Disclosure of Invention
The object of the present invention is to solve at least one of the technical drawbacks mentioned.
Therefore, the invention aims to provide a novel searching method based on public data.
In order to achieve the above object, an embodiment of the present invention provides a novel search method based on public data, including the following steps:
step S1, according to the pre-arranged entry information, crawling the multimedia data related to the content and the entity corresponding to the entry information on the Internet, analyzing to obtain an attribute list of the entry information and a multimedia list related to the entity, and storing the attribute list and the multimedia list in a MYSQL database;
step S2, performing data modeling on the entry data stored in the MYSQL database by adopting a jena tool to merge the attributes of entries of the same category, and analyzing the contents of the merged attributes to obtain the relationship between the entries;
step S3, initializing the search service data, including: initializing data in a MYSQL database, an ES database and a database;
and step S4, performing semantic analysis on the query sentence by adopting an NLP tool, performing word segmentation recognition on the fuzzy sentence, performing keyword retrieval on a WEB search service page according to word segmentation results, and returning a retrieval result.
Further, in step S1, the abstracts and the attribute fields of the content corresponding to the obtained vocabulary entry information and the multimedia data associated with the entity are analyzed to obtain an attribute list corresponding to the vocabulary entry.
Further, in step S2, a jena tool is used to perform data modeling, the terms are used as entities, the extraction formula of the model relationship is extracted from the attribute fields of the entities, after the attribute fields and the attribute values of the entities are obtained, whether the content of the attribute values is another entity is determined, and then whether a corresponding relationship exists between the two entities is determined, so as to obtain the relationship type and the relationship list between the terms.
Further, in the step S3, the data initializing the MYSQL database includes: and initializing the mapping relation between the entity and the multimedia, and being used for reversely inquiring the entity of the multimedia.
Further, in the step S3, the data initializing the ES database includes:
and storing the entity content in the MYSQL database, the entry relation list arranged in jena and the multimedia information into the ES database for searching keywords.
Further, in step S3, in the multimedia information, a feature value of a picture and cover information of a video are added, and the association relationship between the multimedia and the entity is stored in MYSQL.
Further, in step S3, initializing a graph database includes: initializing data of entities and relations in the model, and rendering the graph on the WEB page by using the graph database.
Further, in step S4, the performing keyword search includes: searching the entity, relation, graph relation and multimedia corresponding to the keyword, analyzing and processing the query sentence through an nlp tool to perform word segmentation, and performing query according to the returned word segmentation result; and performing condition query and picture characteristic value search according to the entity attributes.
According to the novel searching method based on the public data, disclosed data is utilized to carry out analysis modeling, a tool nlp is combined to analyze query sentences, a characteristic value algorithm is utilized to calculate the characteristic value of the picture, the picture data manages the entity relationship, and the searching function is strengthened, so that the searching can be carried out on keywords in a data range, semantic analysis can also be carried out on the searching, conditional searching can also be carried out according to entity attribute fields, and finally the entity can also search the picture. In addition, the invention uses the atlas technology to show through the corresponding graph relation of the WEB page, so that the graph relation is vivid, and the data management platform has the management function on the model and the data, so that the maintenance is more convenient.
The invention analyzes the content of the public data (taking hundred-degree encyclopedia data as an example) on the basis of the disclosed data content, and combines a developed search platform to perform keyword retrieval, semantic search, condition query and image search functions on the data, and can analyze data relation and semantic condition retrieval data by means of technologies such as machine learning, artificial intelligence, graph database and the like, and the image characteristic value is matched with and searches related images. The invention can support traditional keyword content retrieval, picture retrieval, semantic retrieval and condition retrieval in a more detailed modeling mode, and can show an entity and a relation graph in detail through a map, so that the data retrieval function is stronger.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow diagram of a novel search method based on public data according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a novel search method based on public data according to an embodiment of the present invention;
fig. 3 is a jena model creation flow chart according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
As shown in fig. 1 and fig. 2, the novel searching method based on public data according to the embodiment of the present invention includes the following steps:
and step S1, according to the pre-arranged entry information, crawling the content corresponding to the entry information and the multimedia data associated with the entity on the Internet, analyzing to obtain an attribute list of the entry information and a multimedia list associated with the entity, and storing the attribute list and the multimedia list in the MYSQL database.
Specifically, the information of the required entries in each field is obtained first, for example, the required entries are sorted and collected from the encyclopedia. For example, when the military related entries need to be sorted, the category of the entries is determined, and there are categories of domestic key characters, fighters, tanks and the like, and then the entry lists under the corresponding entry categories are sorted out. The entry information may include: a category of terms, a specific term, etc.
And then analyzing the abstract and the attribute field of the content corresponding to the obtained entry information and the multimedia data associated with the entity to obtain an attribute list corresponding to the entry.
Specifically, the crawler software crawls corresponding entry data (for example, crawls corresponding entry data in encyclopedia) on the internet and downloads all multimedia data related to the corresponding entry. For the crawled website data, the abstract and the attribute fields need to be analyzed through an algorithm to obtain all the attribute fields of the entry, a relatively complete attribute list is sorted out and stored in the MYSQL database. In addition, the invention arranges the entity associated multimedia list, and stores the summarized data in a MYSQL database in a persistent manner.
And step S2, performing data modeling on the entry data stored in the MYSQL database by adopting a jena tool to merge the attributes of the entries of the same category, and analyzing the relationship between the entries according to the contents of the merged attributes. That is, the entry data in MYSQL is analyzed, and first, entries of the same category are merged according to the sorted attributes, and the relationship between the entries is sorted according to the attribute list of the entries.
Specifically, as shown in fig. 3, a jena tool is used for data modeling, a term is used as an entity, an extraction formula of a model relationship is extracted from an attribute field of the entity, after the attribute field and the attribute value of the entity are obtained, whether the content of the attribute value is another entity is judged, and then whether a corresponding relationship exists between the two entities is judged, so that a relationship type and a relationship list between the terms are obtained. The model created by jena mainly comprises entities and relations, the entities have corresponding entity types and specific entities, each entry is a specific entity, and the entry category which is arranged at first is an entity category. The extraction formula of the model relation is extracted from the attribute field of the entity, and after the attribute field and the attribute value of the entity are obtained, whether the corresponding relation exists between the two entities is judged by judging whether the content of the attribute value is the other entity or not. Such as: if the attribute origin of the fighter-10 is another entry China, China and the fighter-10 have the origin relationship. And finally, modeling the two sorted items by using jena, wherein the terms can be called entities, one term is a specific entity, and the category of the term is the type of the entity. The model has great application in post-remap databases. Through the judgment rule, the relationship type and the relationship list are sorted out. And finally, generating a corresponding owl model file.
Step S3, initializing the search service data, including: initializing data in MYSQL database, ES database and database.
In this step, data in MYSQL, ES, and the graph database Neo4j are initialized.
1. The data initialization of the MYSQL database comprises the following steps: and initializing the mapping relation between the entity and the multimedia, and being used for reversely inquiring the entity of the multimedia.
MYSQL requires initializing the mapping relationship between entities and multimedia, and a reverse query entity for multimedia,
2. the data initialization of the ES database comprises the following steps:
and the ES search middleware stores specific entities and relationship contents, and stores the entity contents in the MYSQL database and the entry relationship list arranged in jena into the ES database for keyword retrieval.
In addition, the multimedia information is stored in the ES database for keyword retrieval. The invention processes the multimedia information in the database, and adds the characteristic value of the picture and the BASE64 cover information of the video. Processing the picture pixels by the picture characteristic value through an algorithm to obtain a corresponding characteristic value, and laying a foundation for later picture searching; the video cover serves as a presentation by extracting a frame with a picture in the video. And storing the processed multimedia information into an ES (ES), and storing the association relation between the multimedia and the entity into a MYSQL database.
3. Initializing a graph database, comprising: initializing data of entities and relations in the model, and rendering the graph on the WEB page by using a Neo4j graph database, wherein the structure of the graph database determines the content of the graph relation, so that the rapid retrieval can be carried out, and the requirement of the graph is just met.
Specifically, specific entity and relationship contents are stored in the graph data, and through the model arranged in jena, the complete entity and relationship list is written into the graph database, so that the rendering of the graph map is convenient to perform later.
And step S4, performing semantic analysis on the query sentence by adopting an NLP tool, performing word segmentation recognition on the fuzzy sentence, performing keyword retrieval on a WEB search service page according to word segmentation results, and returning a retrieval result.
Specifically, in the initializing nlp service, the content of the corpus and the word segmentation library loads the entity type, the relationship type and the specific entity information into the corpus. The method supports nlp natural language semantic analysis ability, performs word segmentation recognition on fuzzy sentences by using methods of speculation, probability and statistics, and finally analyzes information convenient for retrieval and query. Such as: after the mortar with the shooting speed higher than 200km/h and the physical mortar analyzed by natural language, the attribute shooting speed and the attribute value of 200km/h are higher than the condition, the method searches the entity with the shooting speed attribute higher than 200km/h in the mortar.
On a WEB search service page, the invention can carry out keyword retrieval on the analyzed and processed public data, and the range of the keywords comprises: entities, relationships, graph relationships, and multimedia; semantic retrieval can be carried out, query sentences are analyzed and processed through an nlp tool, and query is carried out according to returned word segmentation results; and condition query and picture characteristic value search can be carried out according to the entity attributes. Finally, the invention also supports the management of the model structure and the data content updating on the data management platform.
According to the novel searching method based on the public data, disclosed data is utilized to carry out analysis modeling, a tool nlp is combined to analyze query sentences, a characteristic value algorithm is utilized to calculate the characteristic value of the picture, the picture data manages the entity relationship, and the searching function is strengthened, so that the searching can be carried out on keywords in a data range, semantic analysis can also be carried out on the searching, conditional searching can also be carried out according to entity attribute fields, and finally the entity can also search the picture. In addition, the invention uses the atlas technology to show through the corresponding graph relation of the WEB page, so that the graph relation is vivid, and the data management platform has the management function on the model and the data, so that the maintenance is more convenient.
The invention analyzes the content of the public data (taking hundred-degree encyclopedia data as an example) on the basis of the disclosed data content, and combines a developed search platform to perform keyword retrieval, semantic search, condition query and image search functions on the data, and can analyze data relation and semantic condition retrieval data by means of technologies such as machine learning, artificial intelligence, graph database and the like, and the image characteristic value is matched with and searches related images. The invention can support traditional keyword content retrieval, picture retrieval, semantic retrieval and condition retrieval in a more detailed modeling mode, and can show an entity and a relation graph in detail through a map, so that the data retrieval function is stronger.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. A novel search method based on public data is characterized by comprising the following steps:
step S1, according to the pre-arranged entry information, crawling the multimedia data related to the content and the entity corresponding to the entry information on the Internet, analyzing to obtain an attribute list of the entry information and a multimedia list related to the entity, and storing the attribute list and the multimedia list in a MYSQL database;
step S2, performing data modeling on the entry data stored in the MYSQL database by adopting a jena tool to merge the attributes of entries of the same category, and analyzing the contents of the merged attributes to obtain the relationship between the entries;
step S3, initializing the search service data, including: initializing data in a MYSQL database, an ES database and a database;
and step S4, performing semantic analysis on the query sentence by adopting an NLP tool, performing word segmentation recognition on the fuzzy sentence, performing keyword retrieval on a WEB search service page according to word segmentation results, and returning a retrieval result.
2. The method as claimed in claim 1, wherein in step S1, the abstracted and attribute field analysis is performed on the multimedia data associated with the content and the entity corresponding to the crawled entry information to obtain the attribute list corresponding to the entry.
3. The method as claimed in claim 1, wherein in step S2, data modeling is performed by using jena tool, the term is used as an entity, the extraction formula of model relationship is extracted from the attribute field of the entity, after the attribute field and attribute value of the entity are obtained, the relationship type and relationship list between terms are obtained by determining whether the content of attribute value is another entity, and then determining whether there is a corresponding relationship between two entities.
4. The novel public data-based search method according to claim 1, wherein in step S3, performing data initialization on the MYSQL database comprises: and initializing the mapping relation between the entity and the multimedia, and being used for reversely inquiring the entity of the multimedia.
5. The novel public data-based search method of claim 1, wherein in the step S3, the data initialization of the ES database includes:
and storing the entity content in the MYSQL database, the entry relation list arranged in jena and the multimedia information into the ES database for searching keywords.
6. The novel public data-based searching method of claim 5, wherein in the step S3, the feature value of the picture and the cover information of the video are added to the multimedia information, and the association relationship between the multimedia and the entity is stored in MYSQL.
7. The novel public data-based searching method according to claim 1, wherein in the step S3, initializing a graph database includes: initializing data of entities and relations in the model, and rendering the graph on the WEB page by using the graph database.
8. The novel public data-based search method according to claim 1, wherein in the step S4, the performing keyword retrieval includes: searching the entity, relation, graph relation and multimedia corresponding to the keyword, analyzing and processing the query sentence through an nlp tool to perform word segmentation, and performing query according to the returned word segmentation result; and performing condition query and picture characteristic value search according to the entity attributes.
CN202010169574.7A 2020-03-12 2020-03-12 Novel searching method based on public data Active CN111339391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010169574.7A CN111339391B (en) 2020-03-12 2020-03-12 Novel searching method based on public data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010169574.7A CN111339391B (en) 2020-03-12 2020-03-12 Novel searching method based on public data

Publications (2)

Publication Number Publication Date
CN111339391A true CN111339391A (en) 2020-06-26
CN111339391B CN111339391B (en) 2024-03-19

Family

ID=71186022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010169574.7A Active CN111339391B (en) 2020-03-12 2020-03-12 Novel searching method based on public data

Country Status (1)

Country Link
CN (1) CN111339391B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413784A (en) * 2019-07-23 2019-11-05 国家计算机网络与信息安全管理中心 The public sentiment association analysis method and system of knowledge based map
CN110532404A (en) * 2019-09-03 2019-12-03 北京百度网讯科技有限公司 One provenance multimedia determines method, apparatus, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413784A (en) * 2019-07-23 2019-11-05 国家计算机网络与信息安全管理中心 The public sentiment association analysis method and system of knowledge based map
CN110532404A (en) * 2019-09-03 2019-12-03 北京百度网讯科技有限公司 One provenance multimedia determines method, apparatus, equipment and storage medium

Also Published As

Publication number Publication date
CN111339391B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
Asim et al. The use of ontology in retrieval: a study on textual, multilingual, and multimedia retrieval
Srihari et al. Intelligent indexing and semantic retrieval of multimodal documents
US8051080B2 (en) Contextual ranking of keywords using click data
US7493312B2 (en) Media agent
US8396286B1 (en) Learning concepts for video annotation
Kowalski Information retrieval architecture and algorithms
US20080319973A1 (en) Recommending content using discriminatively trained document similarity
US20090292685A1 (en) Video search re-ranking via multi-graph propagation
US20100185689A1 (en) Enhancing Keyword Advertising Using Wikipedia Semantics
US20090240674A1 (en) Search Engine Optimization
CN108509521B (en) Image retrieval method for automatically generating text index
CN103886020B (en) A kind of real estate information method for fast searching
CN109446313B (en) Sequencing system and method based on natural language analysis
US20210350125A1 (en) System for searching natural language documents
WO2015084404A1 (en) Matching of an input document to documents in a document collection
CN114661872B (en) Beginner-oriented API self-adaptive recommendation method and system
WO2020074788A1 (en) Method of training a natural language search system, search system and corresponding use
CN112989808A (en) Entity linking method and device
CN113065018A (en) Audio and video index library creating and retrieving method and device and electronic equipment
Maynard et al. Change management for metadata evolution
Achsan et al. Automatic Extraction of Indonesian Stopwords
CN111339391B (en) Novel searching method based on public data
Moumtzidou et al. Discovery of environmental nodes in the web
CN113468377A (en) Video and literature association and integration method
Poornima et al. Automatic Annotation of Educational Videos for Enhancing Information Retrieval.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant