CN111339391B - Novel searching method based on public data - Google Patents

Novel searching method based on public data Download PDF

Info

Publication number
CN111339391B
CN111339391B CN202010169574.7A CN202010169574A CN111339391B CN 111339391 B CN111339391 B CN 111339391B CN 202010169574 A CN202010169574 A CN 202010169574A CN 111339391 B CN111339391 B CN 111339391B
Authority
CN
China
Prior art keywords
entity
data
multimedia
relation
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010169574.7A
Other languages
Chinese (zh)
Other versions
CN111339391A (en
Inventor
李林亮
张盛泽
王娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Andlinks Data Technology Co ltd
Original Assignee
Nanjing Andlinks Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Andlinks Data Technology Co ltd filed Critical Nanjing Andlinks Data Technology Co ltd
Priority to CN202010169574.7A priority Critical patent/CN111339391B/en
Publication of CN111339391A publication Critical patent/CN111339391A/en
Application granted granted Critical
Publication of CN111339391B publication Critical patent/CN111339391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a novel searching method based on public data, which comprises the following steps: according to the pre-arranged entry information, crawling multimedia data related to the content and the entity corresponding to the entry information on the Internet, and analyzing and obtaining an attribute list of the entry information and a multimedia list related to the entity; performing data modeling by using a jena tool to combine the attributes of the vocabulary entries of the same category, and analyzing to obtain the relationship among the vocabulary entries according to the content of the combined attributes; initializing the search service data, carrying out semantic analysis on the query sentence by using an NLP tool, carrying out word segmentation recognition on the fuzzy sentence, carrying out keyword search on a WEB search service page according to a word segmentation result, and returning a search result. The invention not only supports traditional keyword content retrieval, but also supports picture retrieval, semantic retrieval and condition retrieval, and can display the entity and the relation diagram in detail through the map, so that the data retrieval function is stronger.

Description

Novel searching method based on public data
Technical Field
The invention relates to the technical field of electronic information, in particular to a novel searching method based on public data.
Background
With the development of computer technology and information technology, paper encyclopedias gradually exits the line of sight of the invention, and online encyclopedias are slowly known by the invention. The hundred degree company published a formal version of the hundred degree encyclopedia at 21, 4/2008, and up to now, 1600 tens of thousands of entries have been recorded, and 700 tens of thousands of people have participated in editing the hundred degree entries, and the hundred degree encyclopedia covers almost all known knowledge fields. Hundred degrees encyclopedia aims at creating a Chinese information collection platform covering knowledge in various fields, and is often an objective fact because hundred degrees encyclopedia is open to disclosure.
The Baidu encyclopedia is not only an information platform but also a search platform, on one hand, the Baidu encyclopedia can record information, and on the other hand, the recorded information can be searched. The hundred-degree encyclopedia search platform searches only by simple word segmentation when searching, and the search function is not strong enough.
Disclosure of Invention
The object of the present invention is to solve at least one of the technical drawbacks.
To this end, the object of the invention is to propose a novel search method based on public data.
In order to achieve the above object, an embodiment of the present invention provides a novel search method based on public data, including the steps of:
step S1, according to pre-arranged entry information, crawling multimedia data related to content and entities corresponding to the entry information on the Internet, analyzing and obtaining an attribute list of the entry information and a multimedia list related to the entities, and storing the attribute list and the multimedia list in a MYSQL database;
step S2, performing data modeling on the vocabulary entry data stored in the MYSQL database by adopting a jena tool to combine the attributes of the vocabulary entries of the same category, and analyzing and obtaining the relationship among the vocabulary entries according to the content of the combined attributes;
step S3, initializing the search service data, including: initializing data in a MYSQL database, an ES database and a graph database;
and S4, carrying out semantic analysis on the query sentence by using an NLP tool, carrying out word segmentation recognition on the fuzzy sentence, carrying out keyword retrieval on a WEB search service page according to the word segmentation result, and returning a retrieval result.
Further, in the step S1, the content corresponding to the obtained term information and the multimedia data associated with the entity are subjected to analysis of the abstract and the attribute field to obtain an attribute list corresponding to the term.
Further, in the step S2, a jena tool is used to perform data modeling, the entry is used as an entity, the extraction of the model relationship is extracted from the attribute field of the entity, after the attribute field and the attribute value of the entity are obtained, whether the content of the attribute value is another entity is judged, and then whether the two entities have a corresponding relationship is judged, so as to obtain the relationship type and the relationship list between the entries.
Further, in the step S3, initializing the MYSQL database includes: and initializing the mapping relation between the entity and the multimedia, and using the mapping relation for the reverse query entity of the multimedia.
Further, in the step S3, initializing the ES database includes:
and storing the entity content in the MYSQL database, the entry relation list arranged in the jena and the multimedia information into the ES database for retrieving keywords.
Further, in the step S3, the characteristic value of the picture and the cover information of the video are added to the multimedia information, and the association relationship between the multimedia and the entity is stored in MYSQL.
Further, in the step S3, initializing the graph database includes: initializing data of entities and relations in the model, and rendering a map on a WEB page by using a map database.
Further, in the step S4, the performing keyword search includes: searching the entity, the relation, the graph relation and the multimedia corresponding to the keywords, analyzing and processing the query sentence through a nlp tool to segment words, and querying according to the returned segmentation result; and carrying out condition query and picture searching according to the entity attribute and the picture characteristic value.
According to the novel searching method based on the public data, disclosed data are utilized to carry out analysis modeling, query sentences are analyzed by combining nlp tools, characteristic value algorithms are utilized to calculate characteristic values of pictures, the picture data manage entity relations, and the searching function is enhanced, so that the searching can be carried out on keywords in a data range, can also be carried out through semantic analysis, can also be carried out on condition searching according to entity attribute fields, and finally can be carried out on pictures by an entity. In addition, the map technology is used for showing the map relationship through the WEB page corresponding map relationship, so that the map relationship is vivid and vivid, and the data management platform has a management function on the model and the data, so that the maintenance is more convenient.
The invention analyzes the content of the public data (taking hundred degrees encyclopedia data as an example) based on the content of the public data, and combines a developed search platform to perform keyword retrieval, semantic search, condition query and image search functions on the data, and can analyze data relation and semantic condition retrieval data by means of machine learning, artificial intelligence, image database and other technologies, and the image characteristic value is matched with the search related image. According to the invention, under a more detailed modeling mode, not only traditional keyword content retrieval but also picture retrieval, semantic retrieval and condition retrieval are supported, and the entity and relation diagram can be displayed in detail through the map, so that the data retrieval function is stronger.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a flow chart of a novel search method based on public data according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a novel search method based on public data according to an embodiment of the present invention;
FIG. 3 is a flow chart of the jena model creation according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
As shown in fig. 1 and fig. 2, the novel searching method based on public data according to the embodiment of the invention includes the following steps:
step S1, according to the pre-arranged entry information, crawling the multimedia data related to the content and the entity corresponding to the entry information on the Internet, analyzing and obtaining an attribute list of the entry information and a multimedia list related to the entity, and storing the attribute list and the multimedia list in a MYSQL database.
Specifically, first, required entry information of each field is acquired, for example, required entries are collected from hundred degrees encyclopedia. For example, the military related terms are required to be sorted, the types of the terms are determined first, and the terms include domestic key characters, fighters, tanks and the like, and then the term list under the corresponding term types is sorted out. Wherein, the term information may include: entry category, specific entry, etc.
And then, analyzing the abstract and attribute fields of the content corresponding to the obtained entry information and the multimedia data related to the entity so as to obtain an attribute list corresponding to the entry.
Specifically, the crawler software crawls corresponding entry data (for example, crawls data of corresponding entries in hundred degrees encyclopedia) on the internet, and downloads all multimedia data associated with the corresponding entries. For the crawled website data, the method needs to analyze the abstract and the attribute fields through an algorithm to obtain all attribute fields of the entry, sort out a relatively complete attribute list and store the attribute list in a MYSQL database. In addition, the invention sorts out the multimedia list related to the entity, and the summarized data are stored in a MYSQL database in a lasting way.
And S2, carrying out data modeling on the vocabulary entry data stored in the MYSQL database by adopting a jena tool to combine the attributes of the vocabulary entries of the same category, and analyzing and obtaining the relationship among the vocabulary entries according to the content of the combined attributes. Namely, the entry data in MYSQL is analyzed, first, entries of the same category are combined, and relationships among the entries are sorted according to an attribute list of the entries.
Specifically, as shown in fig. 3, a jena tool is used for data modeling, an entry is taken as an entity, the extraction of a model relationship is extracted from an attribute field of the entity, after the attribute field and the attribute value of the entity are obtained, whether the content of the attribute value is another entity or not is judged, and then whether a corresponding relationship exists between the two entities or not is judged, so that a relationship type and a relationship list between the entries are obtained. The model created by jena mainly comprises entities and relations, the entities are of corresponding entity types and specific entities, each term is a specific entity, and the term category which is arranged at the beginning is the entity category. The extraction formula of the model relation is extracted from the attribute field of the entity, and after the attribute field and the attribute value of the entity are obtained, whether the corresponding relation exists between the two entities is judged by judging whether the content of the attribute value is another entity or not. Such as: if the attribute of fighter-10 is that of another vocabulary, chinese and fighter-10 have the relationship of producing area. Finally, modeling the two items sorted by using jena, wherein the term can be called an entity, one term is a specific entity, and the category of the term is the type of the entity. The model is then used in a database of drawings. By this judgment rule, the relationship type and the relationship list are sorted out. And finally, generating a corresponding owl model file.
Step S3, initializing the search service data, including: initializing data in a MYSQL database, an ES database and a graph database.
In this step, data in three of MYSQL, ES, and graph database Neo4j are initialized.
1. Initializing the MYSQL database comprises: and initializing the mapping relation between the entity and the multimedia, and using the mapping relation for the reverse query entity of the multimedia.
In MYSQL, the mapping relation between the entity and the multimedia needs to be initialized, and the mapping relation is used for the reverse query entity of the multimedia,
2. the data initialization of the ES database comprises:
the ES searches for the specific entity and relation content stored in the middleware, and stores the entity content in the MYSQL database and the entry relation list sorted by the jena into the ES database for the retrieval of the keywords.
In addition, the multimedia information is stored in the ES database for retrieval of keywords. The invention processes the multimedia information in the database and increases the characteristic value of the picture and the BASE64 cover information of the video. The image characteristic values are processed through an algorithm to obtain corresponding characteristic values, and a foundation is laid for searching images by using images later; the video cover is used as a presentation by extracting a frame with a picture in the video. Storing the processed multimedia information into an ES, and storing the association relation between the multimedia and the entity into a MYSQL database.
3. Initializing a graph database, comprising: the entity and relation data in the model are initialized, the map on the WEB page is rendered by utilizing the Neo4j map database, the structure of the map database determines the content of the map relation, the quick search can be carried out, and the requirement of the map is just met.
Specifically, specific entity and relation contents are stored in the graph data, and through a model arranged in the jena, the invention writes the complete entity and relation list into the graph database, thereby facilitating the subsequent rendering of the graph.
And S4, carrying out semantic analysis on the query sentence by using an NLP tool, carrying out word segmentation recognition on the fuzzy sentence, carrying out keyword retrieval on a WEB search service page according to the word segmentation result, and returning a retrieval result.
Specifically, in the initialization nlp service, the contents of the corpus and the word segmentation library are loaded into the corpus, and the entity category, the relation category and specific entity information are loaded into the corpus. The nlp natural language semantic analysis capability is supported, the fuzzy sentences are subjected to word segmentation recognition by adopting the methods of speculation, probability and statistics, and finally, the information convenient for searching and inquiring is analyzed. Such as: after the mortar with the firing rate higher than 200km/h is finally analyzed by natural language to obtain the entity mortar, the firing rate of the attribute and the attribute value of 200km/h are higher than the condition, the invention searches the mortar for the entity with the firing rate attribute higher than 200 km/h.
On a WEB search service page, the invention can search keywords of the analyzed and processed public data, and the scope of the keywords comprises: entity, relationship, graph relationship, and multimedia; semantic retrieval can be performed, query sentences are analyzed and processed through a nlp tool, and query is performed according to returned word segmentation results; and the condition query and the picture characteristic value can be carried out according to the entity attribute to carry out the picture search. Finally, the invention also supports the management of the model structure and the update of the data content on the data management platform.
According to the novel searching method based on the public data, disclosed data are utilized to carry out analysis modeling, query sentences are analyzed by combining nlp tools, characteristic value algorithms are utilized to calculate characteristic values of pictures, the picture data manage entity relations, and the searching function is enhanced, so that the searching can be carried out on keywords in a data range, can also be carried out through semantic analysis, can also be carried out on condition searching according to entity attribute fields, and finally can be carried out on pictures by an entity. In addition, the map technology is used for showing the map relationship through the WEB page corresponding map relationship, so that the map relationship is vivid and vivid, and the data management platform has a management function on the model and the data, so that the maintenance is more convenient.
The invention analyzes the content of the public data (taking hundred degrees encyclopedia data as an example) based on the content of the public data, and combines a developed search platform to perform keyword retrieval, semantic search, condition query and image search functions on the data, and can analyze data relation and semantic condition retrieval data by means of machine learning, artificial intelligence, image database and other technologies, and the image characteristic value is matched with the search related image. According to the invention, under a more detailed modeling mode, not only traditional keyword content retrieval but also picture retrieval, semantic retrieval and condition retrieval are supported, and the entity and relation diagram can be displayed in detail through the map, so that the data retrieval function is stronger.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made in the above embodiments by those skilled in the art without departing from the spirit and principles of the invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (3)

1. The novel searching method based on the public data is characterized by comprising the following steps:
step S1, according to pre-arranged entry information, crawling multimedia data related to content and entities corresponding to the entry information on the Internet, analyzing and obtaining an attribute list of the entry information and a multimedia list related to the entities, and storing the attribute list and the multimedia list in a MYSQL database;
step S2, performing data modeling on the vocabulary entry data stored in the MYSQL database by adopting a jena tool to combine the attributes of the vocabulary entries of the same category, and analyzing and obtaining the relationship among the vocabulary entries according to the content of the combined attributes;
step S3, initializing the search service data, including: initializing data in a MYSQL database, an ES database and a graph database;
in the step S3, initializing data of the MYSQL database includes: initializing the mapping relation between the entity and the multimedia, and using the mapping relation for the reverse query entity of the multimedia;
in the step S3, the initializing the ES database includes: storing specific entities and relation contents in the ES searching middleware, and storing the entity contents in the MYSQL database, the entry relation list and the multimedia information which are arranged in the jena into the ES database for searching keywords;
in the step S3, feature values of the pictures and cover information of the video are added to the multimedia information, and association relations between the multimedia and the entities are stored in MYSQL;
in the step S3, initializing a graph database includes: initializing data of entities and relations in the model, and rendering a map on a WEB page by using a map database;
specifically, specific entity and relation content are stored in the graph data;
step S4, carrying out semantic analysis on the query sentence by using an NLP tool, carrying out word segmentation recognition on the fuzzy sentence, and carrying out keyword retrieval on a WEB search service page according to a word segmentation result, wherein the keywords comprise: the entity, the relation, the graph relation and the multimedia are returned to the search result;
in the step S4, the keyword search includes: searching the entity, the relation, the graph relation and the multimedia corresponding to the keywords, analyzing and processing the query sentence through a nlp tool to segment words, and querying according to the returned segmentation result; and carrying out condition query and picture searching according to the entity attribute and the picture characteristic value.
2. The method according to claim 1, wherein in step S1, the content corresponding to the obtained entry information and the multimedia data associated with the entity are subjected to analysis of the abstract and the attribute field to obtain the attribute list corresponding to the entry.
3. The method of claim 1, wherein in step S2, a jena tool is used to perform data modeling, an entry is used as an entity, the extraction of a model relationship is extracted from an attribute field of the entity, after the attribute field and the attribute value of the entity are obtained, by determining whether the content of the attribute value is another entity, and then determining whether there is a corresponding relationship between the two entities, a relationship type and a relationship list between the entries are obtained.
CN202010169574.7A 2020-03-12 2020-03-12 Novel searching method based on public data Active CN111339391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010169574.7A CN111339391B (en) 2020-03-12 2020-03-12 Novel searching method based on public data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010169574.7A CN111339391B (en) 2020-03-12 2020-03-12 Novel searching method based on public data

Publications (2)

Publication Number Publication Date
CN111339391A CN111339391A (en) 2020-06-26
CN111339391B true CN111339391B (en) 2024-03-19

Family

ID=71186022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010169574.7A Active CN111339391B (en) 2020-03-12 2020-03-12 Novel searching method based on public data

Country Status (1)

Country Link
CN (1) CN111339391B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413784A (en) * 2019-07-23 2019-11-05 国家计算机网络与信息安全管理中心 The public sentiment association analysis method and system of knowledge based map
CN110532404A (en) * 2019-09-03 2019-12-03 北京百度网讯科技有限公司 One provenance multimedia determines method, apparatus, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413784A (en) * 2019-07-23 2019-11-05 国家计算机网络与信息安全管理中心 The public sentiment association analysis method and system of knowledge based map
CN110532404A (en) * 2019-09-03 2019-12-03 北京百度网讯科技有限公司 One provenance multimedia determines method, apparatus, equipment and storage medium

Also Published As

Publication number Publication date
CN111339391A (en) 2020-06-26

Similar Documents

Publication Publication Date Title
US9600533B2 (en) Matching and recommending relevant videos and media to individual search engine results
US8027977B2 (en) Recommending content using discriminatively trained document similarity
Srihari et al. Intelligent indexing and semantic retrieval of multimodal documents
WO2023108980A1 (en) Information push method and device based on text adversarial sample
US6725217B2 (en) Method and system for knowledge repository exploration and visualization
US20140201180A1 (en) Intelligent Supplemental Search Engine Optimization
US20070112838A1 (en) Method and system for classifying media content
US20090240674A1 (en) Search Engine Optimization
US20080154886A1 (en) System and method for summarizing search results
US8983965B2 (en) Document rating calculation system, document rating calculation method and program
JP2003114906A (en) Meta-document managing system equipped with user definition validating personality
EP1426882A2 (en) Information storage and retrieval
CN108509521B (en) Image retrieval method for automatically generating text index
CN103886020B (en) A kind of real estate information method for fast searching
JP7451747B2 (en) Methods, devices, equipment and computer readable storage media for searching content
JP2005122295A (en) Relationship figure creation program, relationship figure creation method, and relationship figure generation device
WO2010014082A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
CN110750995A (en) File management method based on user-defined map
CN112148886A (en) Method and system for constructing content knowledge graph
CN113065018A (en) Audio and video index library creating and retrieving method and device and electronic equipment
CN112989808A (en) Entity linking method and device
CN110990003B (en) API recommendation method based on word embedding technology
CN117216187A (en) Semantic intelligent retrieval method for constructing legal knowledge graph based on terms
CN114706938A (en) Document tag determination method and device, electronic equipment and storage medium
CN111339391B (en) Novel searching method based on public data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant