CN115168401A - Data grading processing method and device, electronic equipment and computer readable medium - Google Patents

Data grading processing method and device, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN115168401A
CN115168401A CN202210845080.5A CN202210845080A CN115168401A CN 115168401 A CN115168401 A CN 115168401A CN 202210845080 A CN202210845080 A CN 202210845080A CN 115168401 A CN115168401 A CN 115168401A
Authority
CN
China
Prior art keywords
data
digital object
level
metadata
digital
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210845080.5A
Other languages
Chinese (zh)
Inventor
张显达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Big Data Advanced Technology Research Institute
Original Assignee
Beijing Big Data Advanced Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Big Data Advanced Technology Research Institute filed Critical Beijing Big Data Advanced Technology Research Institute
Priority to CN202210845080.5A priority Critical patent/CN115168401A/en
Publication of CN115168401A publication Critical patent/CN115168401A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the field of computer information processing, and in particular, to a data hierarchical processing method and apparatus, an electronic device, and a computer-readable medium. The method comprises the following steps: the method comprises the steps of obtaining data to be processed, grading the data to be processed to obtain graded data, grading, registering and storing the data based on a digital object system, obtaining specific data of a digital object according to search keywords, achieving the technical effect that a document entity corresponding to an identifier can be analyzed in any application system supporting DOI, grading the data, grading and storing the graded data, being capable of finding full text contents through the contents of any level of data, and achieving a flexible finding mode.

Description

Data grading processing method and device, electronic equipment and computer readable medium
Technical Field
The present application relates to the field of computer information processing, and in particular, to a data hierarchical processing method and apparatus, an electronic device, and a computer-readable medium.
Background
The convergence and fusion of the Internet and other networks (including a telecommunication network, a mobile network, an Internet of things and the like) further promotes the fusion of human society, information space and physical world, forms a new human-computer-object fusion computing environment, and promotes the human-computer-object fusion application of smart homes, smart cities, intelligent manufacturing and the like. The man-machine-object fusion marks that the interconnection of terminals, the interconnection of users and the interconnection of applications is going to the interconnection of everything, the information technology and the application thereof are more ubiquitous, big data is generated along with the man-machine-object fusion, and the third wave (informatization 3.0) of informatization is started, namely, an intelligent stage which is mainly characterized by the deep mining and fusion application of data is started. The data becomes the fundamental strategic resources and the key production elements in the new world view of 'everything digitalization'.
One common requirement of human-machine-thing fusion is data interconnection and intercommunication and intelligent application. For example, a smart city needs to gather government affair data, social data and even personal data to master and predict the running state of the city, so as to realize intelligent decision and high-efficiency management; the intelligent manufacturing needs to get through data between production equipment of upstream and downstream of an industrial chain, each department of an enterprise and each factory so as to realize accurate monitoring and control of the whole production and manufacturing process.
The internet solves the problem of data transmission among machines and provides a basic technical means for data interconnection. However, the internet cannot avoid and solve many problems of information isolated island, data runaway, data right confirmation and the like caused by continuous development of information technology and application thereof, and the fundamental reason is that the core of the internet, namely the TCP/IP protocol only concerns encoding and decoding and transmission control of data at a binary bit level, so that data with a packet as a unit is efficiently exchanged among different computing devices, a consistent binary packet is obtained on any device of the network, and identification, encoding and analysis of the data all depend on upper-layer application processing. Therefore, in order to achieve data interconnection and intercommunication based on internet data transmission, application systems need to coordinate themselves, and achieve agreement on the aspects of data syntax, semantics, pragmatics, and the like, and the system faces the challenges of high coordination cost, difficult guarantee of responsibility effect, low efficiency, error easiness, hard disk replication, and the like.
In order to realize trusted, manageable and controllable data interconnection and intercommunication and intelligent application on the internet which is not trusted and uncontrollable, various solutions are provided by the academic world and the industrial world. For example, the data center station gathers data, provides uniform data access service for upper application by using a standard API, and reduces the complexity of data application development, thereby solving the information island problem in data interconnection and intercommunication; the Federal learning platform adopts an encryption mechanism to exchange and assemble intermediate calculation results in a data generation source training model, and the common training model can be still realized without leaving the local data, so that the problem of data out-of-control in data interconnection and intercommunication is solved.
The existing solutions are inspected, and the existing solutions are easily discovered, and all the solutions realize credible manageable controllable data interconnection and intercommunication by building a software platform on the Internet. Although data of all parties inside the platform can be communicated trustfully, the data communication between the platforms still has the problems of difficult understanding, difficult access, difficult control and the like. The situation is very similar to the early development of the internet, and each organization adopts a privatized protocol to construct an internal network, so that computers in the network can be reliably communicated, but the networks are difficult to communicate with each other. The key technology of the internet, namely TCP/IP protocol, was developed by cooperation of Town-seph (Vinton Cerf) and Robert-Kahn (Robert Kahn) in 1973, and the core idea of the technology is to solve the difficult problem of interconnection of the heterogeneous networks through an open architecture and a standardized protocol.
Therefore, aiming at the problem of the existing platform-based Data interconnection and intercommunication, a feasible idea is to use the design concept of the Internet for reference, adopt the idea of software definition, and connect various heterogeneous Data platforms and systems through an open software architecture and a standardized interoperation protocol taking Data as the center, so as to form a virtual/Data network on the physical/machine Internet, which can be called as a Data Internet (of Data), referred to as a Data networking for short, and further realize the Data interconnection and intercommunication and intelligent application of the whole network integration.
In fact, early exploration has been conducted at home and abroad aiming at the concept of digital networking. Among them, the most representative one is Digital Object Architecture (DOA) proposed by robert-kann. After the internet succeeded in the TCP/IP Protocol for machine interconnection and intercommunication in the 70 th century in 20 th century, in order to establish a set of Digital Object architecture for common treatment in all countries in the world on the internet, a Digital Object Interface Protocol (DOIP) and a Digital Object Identifier Resolution Protocol (DO-IRP) for data interconnection and intercommunication were formally proposed in the beginning of the 21 st century, and a technical standard open foundation alliance DONA was established in 2014 to promote DOA. At present, DOA has published two standard suggestions in International telecommunication Union telecommunication standards office (ITU-T), and a global digital object identifier analysis system and a plurality of application systems are established, so that DOA becomes the most influential technology and standard system in the newly emerging field of digital networking.
DOA has achieved global scale-up in the field of digital libraries, the DOI system. Digital resources such as books, papers, reports, videos and the like are constructed into digital objects, unique and persistent DOI identifications are distributed, document entities corresponding to the identifications can be analyzed in any application system supporting the DOI, and the problem that the resources are not accessible due to the change of the resource positions in a Uniform Resource Locator (URL) is solved. By 5 months in 2021, DOI has registered about 2.57 hundred million digital objects globally, covering numerous academic databases at home and abroad, such as IEEE, ACM, springer, wanfang, and the internet.
By searching in the DOI system, the technical effect of searching specific data according to the metadata can be realized, however, the current searching mode is single, and flexible searching cannot be realized.
Disclosure of Invention
In view of the above, the present application provides a data classification processing method and apparatus, an electronic device, and a computer readable medium, which avoid the problem that resources are not accessible due to a change in a resource location, which is common in a Uniform Resource Locator (URL), and achieve a technical effect that a document entity corresponding to an identifier can be resolved in any application system supporting DOI.
According to an aspect of the present application, a data hierarchical processing method is provided, the method including: acquiring data to be processed; grading the data to be processed to obtain graded data; determining digital objects of various hierarchies according to the hierarchical data, wherein the digital objects comprise metadata and specific data; for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identifier of the digital object in a registration system and assigning a data repository; storing specific data of the digital object into the data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a tag of the digital object, and the tag of the digital object is used for matching with a search keyword when a user searches the digital object; determining a search keyword; matching the search keywords with the tags in the metadata, screening out the digital objects corresponding to the tags which are successfully matched, and acquiring unique identifiers of the digital objects; querying a data warehouse for storing specific data of the digital object from the registration system according to the unique identifier; specific data of the digital object is obtained from the queried data warehouse.
Further, the metadata further comprises hierarchy data, the hierarchy data comprises unique identifiers of digital objects at a previous hierarchy and/or a next hierarchy, and the unique identifiers of the data objects at different hierarchies are distinguished in a preset mode.
Further, the method further comprises: determining the hierarchical data in the metadata of the digital object corresponding to the successfully matched label; determining the digital object of the previous level step by step according to the level data until determining the digital object of the highest level; determining all digital objects of the next level step by step according to the digital object of the highest level until all digital objects of the lowest level are determined; querying a data repository storing all digital objects at the lowest hierarchical level from the registry; all of the lowest hierarchical levels of digital object are queried for specific data from the data warehouse.
Further, the data searching step further comprises: determining the hierarchical data in the metadata of the digital object corresponding to the successfully matched tag; determining a digital object of a highest hierarchical level from the hierarchical data; querying a data repository storing the digital object at the highest hierarchical level from the registry; specific data of the digital object of the highest hierarchical level is queried from the data warehouse.
Further, the method further comprises: determining the digital object of the previous level step by step according to the level data in the metadata until determining the digital object of the highest level; and adding the newly added hierarchy data to the metadata of the digital object at the highest hierarchy level.
According to another aspect of the present application, there is provided a data hierarchical processing apparatus, the apparatus including: the system comprises a data acquisition unit, a data grading unit, a grading registration and storage unit based on a digital object system and a search unit, wherein the data acquisition unit is used for: acquiring data to be processed; the data classification unit is configured to: grading the data to be processed to obtain graded data; the hierarchical registration and storage unit based on the digital object hierarchy is used for: determining digital objects of various hierarchies according to the hierarchical data, wherein the digital objects comprise metadata and specific data; for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identification of the digital object in the registration system and assigning a data repository; storing specific data of the digital object into the data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a tag of the digital object, the tag of the digital object is used for matching with a search keyword when a user searches the digital object, and the search unit is used for: determining a search keyword; matching the search keywords with the tags in the metadata, screening out the digital objects corresponding to the tags which are successfully matched, and acquiring unique identifiers of the digital objects; querying a data warehouse for storing specific data of the digital object from the registration system according to the unique identifier; specific data of the digital object is obtained from the queried data warehouse.
Further, the metadata further comprises hierarchy data, the hierarchy data comprises unique identifiers of digital objects at a previous hierarchy and/or a next hierarchy, and the unique identifiers of the data objects at different hierarchies are distinguished in a preset mode.
Further, the apparatus further comprises an adding unit configured to: determining the digital object of the previous level step by step according to the level data in the metadata until determining the digital object of the highest level; and adding the newly added hierarchy data to the metadata of the digital object at the highest hierarchy level.
According to yet another aspect of the present application, an electronic device is provided, the electronic device comprising: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.
According to yet another aspect of the application, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.
The data grading processing method provided by the application comprises the following steps: acquiring data to be processed; grading the data to be processed to obtain graded data; determining digital objects of various hierarchies according to the hierarchical data, wherein the digital objects comprise metadata and specific data; for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identifier of the digital object in a registration system and assigning a data repository; storing specific data of the digital object into the data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a tag of the digital object, and the tag of the digital object is used for matching with a search keyword when a user searches the digital object; determining a search keyword; matching the search keywords with the tags in the metadata, screening out the digital objects corresponding to the successfully matched tags and acquiring unique identifiers of the digital objects; querying a data warehouse for storing specific data of the digital object from the registration system according to the unique identifier; specific data of the digital object is obtained from the queried data warehouse. By employing a DOA system/architecture, each digital object is assigned a globally unique identification. The identity is a core attribute of the digital object that does not change with changes in the owner, storage location, access mode of the digital object. The document entity corresponding to the identification can be analyzed in any application system supporting DOI, the problem that resources are not accessible due to resource position change in a Uniform Resource Locator (URL) is solved, the data are classified, the classified data are stored in a classified mode, the full-text content can be searched through the content of any level of data, and a flexible searching mode is achieved.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are only some embodiments of the present application, and other drawings may be derived from those drawings by those skilled in the art without inventive effort.
FIG. 1 is a flow diagram illustrating a data staging method in accordance with an exemplary embodiment.
FIG. 2 is a diagram illustrating a correlation analysis result according to an example embodiment.
Fig. 3 is a block diagram illustrating a data hierarchy processing apparatus according to another exemplary embodiment.
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 5 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the present concepts. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present application and are, therefore, not intended to limit the scope of the present application.
In the present application, "metadata" and "meta information" represent the same meaning; "data to be processed" and "information to be processed" represent the same meaning; "hierarchy data" and "hierarchy information" represent the same meaning. "keyword" and "keyword" represent the same meaning.
Fig. 1 is a flowchart of a data hierarchical processing method according to the present application. The method is based on a digital object hierarchy.
The Digital Object Architecture (DOA) comprises: registry (registry), registry System (Handle System), data warehouse (Repository).
The Registry is used to store metadata of a DO (Digital Object), the Handle System is used for registration of the DO, and records which reproducibility the DO is stored in, and the reproducibility is used to store the DO.
As shown in fig. 1, the method comprises the steps of: the method comprises the steps of data acquisition, data grading, grading registration and storage based on a digital object system and searching.
Step S101: and acquiring data to be processed.
The data to be processed can be collected and classified data or collected data. The manner of gathering includes automated gathering and gathering with human involvement, such as distributed web crawlers or related experts well known in the art. When collecting, the organization or author to which the information belongs, the generation time and the title need to be marked, and if possible, the target position related to the information needs to be collected.
As an alternative, an automatic crawler is used, and information sources with important attention, such as websites of various news media, are determined first. After the information source is determined, a crawling range needs to be determined, such as whether to track external links in a target website, the number of crawling layers, whether to crawl files or pictures, and the like.
Step S102: and grading the data to be processed to obtain graded data.
The step of ranking the data to be processed may also be referred to as a structured parsing step. The data to be processed is classified according to the type of the data to be processed, and after the classification is completed, the data to be processed is analyzed into a plurality of minimum information units (also referred to as minimum units in short in this application).
The composition of the information (data to be processed) may be in the form of a file, for example, one report contains a plurality of files, in which case the report may be regarded as superior information, each file being a sub-information it contains. In addition to a file, a web page may contain multiple links, and each link may be considered a unit of sub-information.
For the text information of the website, the tree structure can be arranged (graded) from top to bottom. The top level is website information, the second level is web page information, the web page contains different articles, and the sentences in the articles are used as the minimum units of the digital object.
As an alternative embodiment, the structural analysis (grading) is performed paragraph by sentence. Paragraph parsing is simple, and only the document of the article needs to be segmented according to the carriage return symbol. However, sentences cannot be segmented by periods as in paragraphs, and because some abbreviations such as u.s.a also contain periods, the sentence segmentation needs to be assisted by an abbreviation corpus.
The step of ranking includes extracting attributes from the articles in addition to segmenting paragraphs and sentences of the articles. Such as the author of the article, the organization of the article, etc. With the help of the knowledge base in the professional field, entities in the sentence can be labeled, such as a place, a unit and the like.
Step S103: hierarchical registration and storage is based on a digital object hierarchy.
Step S103 specifically includes: determining digital objects of all levels according to the classified data, wherein the digital objects comprise metadata and specific data, and the metadata comprises the classification information of the digital objects, so that the digital objects are convenient to retrieve in the future; for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identifier of the digital object in a registration system and allocating a data warehouse; and storing specific data of the digital object into a data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a tag of the digital object, and the tag of the digital object is used for matching with a search keyword when a user searches the digital object.
Step S103 is a process of reorganizing and storing the hierarchical data obtained in step S102 in accordance with the digital object hierarchy.
The specific data is information reflecting the content of the data to be processed.
The hierarchical information of the data is mainly embodied by meta-information in the digital object.
The meta information/meta data is mainly used to describe the digital object, which is convenient for the user to search. The metadata is more flexible in its composition, but generally speaking, it contains the following information: an ID number of the digital object (which may also be referred to herein as a DOID or a unique identification of the digital object), a type of the digital object, a length of the digital object, a format of the digital object, a number of keywords (which may also be referred to herein as a tag) of the digital object.
In a broad sense, any auxiliary information of the digital object content itself may be referred to as metadata, such as the time of recording of the article, the person recording, confirmation of the trustworthiness of the article by the person recording, the validity period of the information, and so on.
Step S104: and acquiring specific data of the digital object according to the search keyword.
Step S104 specifically includes: determining a search keyword; matching the search keyword with the tags in the metadata, screening out the digital object corresponding to the successfully matched tag, and acquiring the unique identifier of the digital object; inquiring a data warehouse for storing specific data of the digital object from the registration system according to the unique identifier; specific data of the digital object is obtained from the queried data warehouse.
The data grading processing method provided by the application comprises the following steps: acquiring data to be processed; grading the data to be processed to obtain graded data; determining a digital object of each hierarchy according to the hierarchical data, wherein the digital object comprises metadata and specific data; for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identifier of the digital object in a registration system and assigning a data repository; storing specific data of the digital object into a data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a tag of the digital object, and the tag of the digital object is used for matching with a search keyword when a user searches the digital object to determine the search keyword; matching the search keywords with the tags in the metadata, screening out the digital objects corresponding to the tags which are successfully matched, and acquiring unique identifiers of the digital objects; inquiring a data warehouse for storing specific data of the digital object from the registration system according to the unique identifier; specific data of the digital object is obtained from the queried data warehouse. By employing a DOA system/architecture, each digital object is assigned a globally unique identification. The identity is a core attribute of the digital object that does not change with changes in the owner, storage location, access mode of the digital object. The document entity corresponding to the identification can be analyzed in any application system supporting DOI, the problem that resources are inaccessible caused by resource position change in a Uniform Resource Locator (URL) is solved, the data are classified and the classified data are stored in a classified mode, the full-text content can be searched through the content of any level of data, and a flexible searching mode is achieved.
Next, the method of the present application will be described in detail.
For example, the text of an article includes a total of 6 natural segments, and the 6 natural segments respectively include the following sentences: 1, 2, 3, 4, 2, 3. This article is data to be processed (which may also be referred to as information to be processed). As an alternative embodiment, the sentence is taken as the minimum information unit.
A sentence is an information unit (a sentence is the smallest information unit), a natural segment is also an information unit, and an article is also an information unit. A hierarchical relationship exists between these units of information. Sentences are located at the lowest level, paragraphs are located at the middle level, and articles are located at the highest level. A sentence is a sub-information unit of a paragraph to which the sentence belongs, and a paragraph is a sub-information unit of an article to which the paragraph belongs.
The information is nested and stored by nesting the IDs of the information units of different hierarchies.
Natural paragraph 1 includes 1 sentence, which is the 1 st sentence of the full text.
Natural paragraph 2 includes 2 sentences.
The first sentence in [ natural paragraph 2 ] is the 2 nd sentence of the full text, which is hereinafter simply referred to as sentence 2.
The second sentence of [ natural paragraph 2 ] is the 3 rd sentence of the full text, which is hereinafter simply referred to as sentence 3.
Natural paragraph 3 includes 3 sentences.
The first sentence in [ natural paragraph 3 ] is the 4 th sentence in the full text, which is hereinafter simply referred to as sentence 4.
The second sentence of [ natural paragraph 3 ] is the 5 th sentence of the full text, which is hereinafter simply referred to as sentence 5.
The third sentence of [ natural paragraph 3 ] is the 6 th sentence of the full text, which is hereinafter simply referred to as sentence 6.
Natural paragraph 4 includes 4 sentences.
The first sentence of [ natural paragraph 4 ] is the 7 th sentence of the full text, which is hereinafter simply referred to as sentence 7.
The second sentence of [ natural paragraph 4 ] is the 8 th sentence of the full text, which is hereinafter simply referred to as sentence 8.
The third sentence of [ natural paragraph 4 ] is the 9 th sentence of the full text, which is hereinafter simply referred to as sentence 9.
The fourth sentence of [ natural paragraph 4 ] is the 10 th sentence of the full text, which is hereinafter simply referred to as sentence 10.
Natural paragraph 5 includes 2 sentences.
The first sentence in [ natural paragraph 5 ] is the 11 th sentence of the full text, which is hereinafter simply referred to as sentence 11.
The second sentence of [ natural paragraph 5 ] is the 12 th sentence of the full text, which is hereinafter simply referred to as sentence 12.
Natural paragraph 6 includes 3 sentences.
The first sentence of [ natural paragraph 6 ] is the 13 th sentence of the full text, which is hereinafter simply referred to as sentence 13.
The second sentence of [ natural paragraph 6 ] is the 14 th sentence of the full text, which is hereinafter simply referred to as sentence 14.
The third sentence of [ natural paragraph 6 ] is the 15 th sentence of the full text, which is hereinafter simply referred to as sentence 15.
Metadata of data of different levels includes different aspects, for example, at the level of an article, an organization publishing the article, publication time of the article, media publishing the article, an article author, keywords of the article, ID of the article, and the like can be used as metadata; at the level of a paragraph, keywords of the paragraph, an ID of the paragraph, and the like may be used as metadata; at the sentence level, keywords of sentences, IDs of sentences, and the like may be used as metadata.
The inventors of the present application have found that if a sentence is stored in the data repository as the smallest unit of information, the information that can be retrieved is at the level of the sentence. For example, the article includes 15 sentences, the metadata of the 15 sentences is stored in the registry (as shown in table 1), the mapping relationship between the ID in the metadata and the data warehouse is stored in the registry (as shown in table 2), and the specific data of the 15 sentences is stored in the data warehouse (as shown in table 3). Then, when searching is performed based on the search keyword, the sentences can be retrieved by using an appropriate search keyword.
TABLE 1
Metadata
Figure RE-GDA0003803756620000121
TABLE 2
Figure RE-GDA0003803756620000122
Figure RE-GDA0003803756620000131
TABLE 3
Figure RE-GDA0003803756620000132
The search keyword is "keyword two", the keyword "keyword two" is matched with the tags in table 1, and the successfully matched ID is 100001, 100015.
In table 2, the number of the data warehouse corresponding to ID100001 is found to be C0000051; ID100015 corresponds to data warehouse number C0000065.
The content of sentence 1 is obtained by looking up the specific data stored in the data warehouse numbered C0000051 in table 3. Looking up the specific data stored in the data warehouse numbered C0000065 in table 3, the content of sentence 15 is obtained.
Thus, two sentences in the article with the label containing the keyword two are searched.
However, the inventor found that it is not enough to know only two separate sentences to perform data analysis based on the search results, and it is also necessary to know the content of the article in which the two sentences are located, and information such as the author, organization, source, and publication time of the article.
The inventor thinks that there is a method to improve this, that is, information such as author, organization, source, publication time of the article in which the sentence is located is stored in the registry as metadata of the sentence, so that after the sentence is retrieved according to the search keyword, the information such as author, organization, source, publication time of the article in which the sentence is located can be known according to the content of the metadata. However, there are still two problems: 1. the specific content of the article in which this sentence is located is still unknown. 2. The meta information of the sentence is too numerous and complex, redundant information is too much, and a large amount of storage space is occupied. For example, the meta information of the 15 sentences includes information such as author, organization, source, publication time, etc. of the article. This information is identical, which results in a large amount of redundant information occupying storage resources. This is also only the case for short articles. If an article is long, for example, there are 100 sentences, the meta-information of each sentence contains the author, organization, source, publication time, etc. of the article, which results in that the author, organization, source, publication time, etc. of the article are stored in the registry 100 times, resulting in redundancy.
The inventor further conceived that if an ID is also assigned to a paragraph in which a sentence is located, the paragraph ID is stored as meta information of the sentence, and then an ID is also assigned to an article in which the paragraph is located, and the article ID is stored as meta information of the paragraph. Therefore, after a certain sentence is searched out by searching the keywords, the ID of the paragraph where the sentence is located can be known according to the meta information of the sentence, and the paragraph can be positioned through the ID of the paragraph. Then, the ID of the article where the paragraph is located is obtained according to the meta information of the paragraph, so that the article where the sentence is located can be located through the ID of the article. The metadata of the article includes information such as author, organization, source, publication time, etc. of the article. Therefore, only the information such as the author, the organization, the source, the publishing time and the like of the article is needed to be used as the meta-information of the article, and the information such as the author, the organization, the source, the publishing time and the like of the article is not needed to be used as the meta-information of each sentence, so that the amount of stored data is greatly reduced.
The inventor further thinks that if the meta information of the article includes the ID of each paragraph included in the article and the meta information of the paragraph includes the ID of each sentence included in the paragraph, the ID of each paragraph can be known from the meta information of the article, so as to locate each paragraph, and the ID of the sentence included in each paragraph can be known from the meta information of each paragraph, so as to locate each sentence included in each paragraph. When any sentence in a certain article is searched, the whole content of the article can be known.
Namely, the article, the paragraph and the sentence are in three different levels, the metadata of each level is selected in different manners, the metadata of the level of the sentence includes the ID of the sentence and the ID of the paragraph to which the sentence belongs, and the metadata of the level of the paragraph includes the ID of the paragraph, the ID of the article to which the paragraph belongs and the IDs of all sentences contained in the paragraph.
The IDs of the data of different hierarchies may be distinguished using prefixes, suffixes, and the like. For example, in table 4, the first digit is 1 is the ID of a sentence, the first digit is 2 is the ID of a paragraph, and the first digit is 3 is the ID of an article. For example, the first digit of 15 IDs "100001", "100002", … … "100015" is 1, and is the ID of a sentence. For example, the first bit of 6 IDs "200001", "200002", … … ", and" 200006 "is 2, which is the ID of the paragraph. For example, "300001" has a first bit of 3 and is the ID of the article.
TABLE 4
Metadata
Figure RE-GDA0003803756620000151
Figure RE-GDA0003803756620000161
Figure RE-GDA0003803756620000171
The search keyword is "keyword two", the keyword "keyword two" is matched with the tags in table 4, and the successfully matched IDs are 100001, 100015, 200001, 200006 and 300001.
As described above, in Table 4, the first digit is 1 is the ID of a sentence, the first digit is 2 is the ID of a paragraph, and the first digit is 3 is the ID of an article. Therefore, it can be seen that ID100001 and ID100015 are the IDs of sentences, ID200001 and ID200006 are the IDs of paragraphs, and ID300001 is the ID of an article.
From any of these 5 IDs, the entire content of the article can be known.
The method for querying the whole content of the article according to ID100001 is as follows:
in table 4, it can be found that ID of the previous level data with ID100001 is 200001, ID of the previous level data with ID200001 is found in table 4, and the found metadata with ID300001 and ID300001 does not include the ID of the previous level data, and therefore ID300001 is the highest level data. The ID of the next level data of ID300001 is searched in table 4, and the searched IDs are 200001, 200002, 200003, 200004, 200005, 200006. In table 4, the ID of the next stage data of ID200001 is looked up, and the found ID is 100001. In table 4, the ID of the next stage data of ID200002 is searched, and the searched IDs are 100002 and 100003. The ID of the next stage data of ID200003 is found in table 4, and found IDs are 100004, 100005 and 100006. The ID of the next stage data of ID200004 is found in table 4, and found IDs are 100007, 100008, 100009, 100010. In table 4, the ID of the next stage data of ID200005 is searched, and the searched IDs are 100011 and 100012. The ID of the next stage data of ID200006 is found in table 4, and found IDs 100013, 100014, and 100015. Since the metadata of ID100001, ID100002, ID100003, ID100004, ID100005, ID100006, ID100007, ID100008, ID100009, ID100010, ID100011, ID100012, ID100013, ID100014, and ID100015 do not include the ID of the next-level data, it is described that these data are the data of the lowest level. Thus, the IDs of all sentences of the article (IDs of all data at the lowest hierarchical level) are known, and the number of the corresponding data warehouse is looked up in table 2 according to the ID of each of the 15 sentences, and then the specific data stored in the data warehouse is looked up in table 3 according to the number of the data warehouse. Thus, the article in which the sentence is located can be restored according to one sentence. Further, information such as an article author can be known from the metadata of the ID 300001.
The method for querying the entire content of the article according to ID100015 is similar to the method for querying the entire content of the article according to ID100001, and is not described again.
The method for querying the entire content of the article according to ID200001 is as follows:
in table 4, the ID of the previous level data of ID200001 is searched, the searched ID is 300001, and the metadata of ID300001 does not include the ID of the previous level data, and therefore ID300001 is the data of the highest hierarchy level. The ID of the next level data of ID300001 is searched in table 4, and the searched IDs are 200001, 200002, 200003, 200004, 200005, 200006. In table 4, the ID of the next stage data of ID200001 is searched, and the searched ID is 100001. In table 4, the ID of the next stage data of ID200002 is searched, and the searched IDs are 100002 and 100003. The ID of the next stage data of ID200003 is found in table 4, and found IDs are 100004, 100005 and 100006. The ID of the next stage data of ID200004 is found in table 4, and found IDs are 100007, 100008, 100009, 100010. In table 4, the ID of the next stage data of ID200005 is searched, and the searched IDs are 100011 and 100012. In table 4, the ID of the next level data of ID200006 is searched, and the searched IDs are 100013, 100014 and 100015. Since the metadata of ID100001, ID100002, ID100003, ID100004, ID100005, ID100006, ID100007, ID100008, ID100009, ID100010, ID100011, ID100012, ID100013, ID100014, and ID100015 do not include the ID of the next-level data, it is described that these data are the lowest-level data. Thus, the IDs of all sentences of the article (IDs of all data at the lowest hierarchical level) are known, and the number of the corresponding data warehouse is looked up in table 2 according to the ID of each of the 15 sentences, and then the specific data stored in the data warehouse is looked up in table 3 according to the number of the data warehouse. This enables the article to be recovered. Further, information such as an article author can be known from the metadata of the ID 300001.
The method for querying the whole content of the article according to the ID200006 is similar to the method for querying the whole content of the article according to the ID200001, and is not described again.
The method for querying the whole content of the article according to the ID300001 is as follows:
in table 4, the ID of the upper-level data of ID300001 is searched, and it is found that the metadata of ID300001 does not include the ID of the upper-level data, and therefore, ID300001 is the data of the highest hierarchy level. The ID of the next level data of ID300001 is searched in table 4, and the searched IDs are 200001, 200002, 200003, 200004, 200005, 200006. In table 4, the ID of the next stage data of ID200001 is looked up, and the found ID is 100001. In table 4, the ID of the next stage data of ID200002 is searched, and the searched IDs are 100002 and 100003. The ID of the next stage data of ID200003 is found in table 4, and found IDs are 100004, 100005 and 100006. The ID of the next stage data of ID200004 is found in table 4, and found IDs are 100007, 100008, 100009, 100010. In table 4, the ID of the next stage data of ID200005 is searched, and the searched IDs are 100011 and 100012. The ID of the next stage data of ID200006 is found in table 4, and found IDs 100013, 100014, and 100015. Since the metadata of ID100001, ID100002, ID100003, ID100004, ID100005, ID100006, ID100007, ID100008, ID100009, ID100010, ID100011, ID100012, ID100013, ID100014, and ID100015 do not include the ID of the next-level data, it is described that these data are the data of the lowest level. Thus, the IDs of all sentences of the article (IDs of all data at the lowest hierarchical level) are known, and the number of the corresponding data warehouse is looked up in table 2 according to the ID of each of the 15 sentences, and then the specific data stored in the data warehouse is looked up in table 3 according to the number of the data warehouse. This enables the article to be recovered. Further, information such as an article author can be known from the metadata of the ID 300001.
If the metadata contains the ID of the upper level data but not the ID of the lower level data, it is indicated that this digital object is the data of the lowest hierarchy level, and the digital object corresponds to the smallest information unit. If the metadata contains the ID of the next level data but does not contain the ID of the previous level data, it indicates that the digital object is the data of the highest hierarchy level. If the metadata contains the ID of the next level data and the ID of the previous level data, the digital object is the data of the middle level.
For any one of the IDs retrieved based on the search key, the metadata of the digital object corresponding to the ID is looked up in the registry. Three possible situations may arise.
In the first case: if the metadata of the ID contains the ID of the data of the upper level but does not contain the ID of the data of the lower level, it indicates that the digital object is the data of the lowest level, and at this time, the ID of the data of the upper level is searched level by level until the ID of the data of the highest level is found. And then determining the next-level data according to the ID of the next-level data contained in the metadata of the highest-level data, and searching the ID of the next-level data level by level until all the data of the lowest level are found. And inquiring the serial number of the data warehouse in the registration system according to the ID of the data of the lowest hierarchy, and inquiring specific data in the data warehouse according to the serial number of the data warehouse, so that all contents can be restored.
In the second case: if the metadata of the ID includes the ID of the data of the upper level and the ID of the data of the lower level, it indicates that the digital object is the data of the intermediate level, and at this time, the ID of the data of the upper level is searched level by level until the ID of the data of the highest level is found. And then determining the next-level data according to the ID of the next-level data contained in the metadata of the highest-level data, and searching the ID of the next-level data level by level until all the data of the lowest level are found. And inquiring the serial number of the data warehouse in the registration system according to the ID of the data of the lowest hierarchy, and inquiring specific data in the data warehouse according to the serial number of the data warehouse, so that all contents can be restored.
In the third case: if the metadata of the ID contains the ID of the next-level data but does not contain the ID of the previous-level data, the digital object is the data of the highest level, at the moment, the next-level data is determined according to the ID of the next-level data contained in the metadata, and the ID of the next-level data is searched one by one until all the data of the lowest level are found. And inquiring the serial number of the data warehouse in the registration system according to the ID of the data of the lowest hierarchy, and inquiring specific data in the data warehouse according to the serial number of the data warehouse, so that all contents can be restored.
Thus, the technical effect of inquiring the full text of the article according to any ID of any hierarchy is achieved.
It can be easily seen that this method has considerable advantages. 1. Information such as article authors and the like only needs to be stored once as meta-information of the articles, but does not need to be stored as meta-information of paragraphs or sentences, which greatly reduces the amount of data stored in the registry. 2. If any sentence in the article is searched according to the search keyword, all information of the article, including information such as article text and article author, can be known. 3. In the data warehouse are stored the smallest units of information (sentences), and not paragraphs or full articles. The paragraphs or article full texts are obtained through the combination of sentences. Namely, the sentence where the search keyword is located can be located, and the paragraph and the article where the search keyword is located can also be located. Repeated storage in three forms of sentences, paragraphs and articles in the data warehouse is not needed, so that the specific data quantity needing to be stored in the data warehouse is reduced to the maximum extent.
The inventors have found that as an alternative embodiment, the metadata may also contain the ID of the upper level data instead of the ID of the lower level data, and store the specific data of the DO at the highest level into the data warehouse. For example, as shown in Table 5, the metadata of a sentence contains the ID of the paragraph. The metadata of a paragraph contains the ID of an article, but does not contain the ID of a sentence. The metadata of the article does not contain the ID of the paragraph. The data warehouse stores therein specific data of the DO of ID 300001.
TABLE 5
Metadata
Figure RE-GDA0003803756620000211
Figure RE-GDA0003803756620000221
The search keyword is "keyword two", the keyword "keyword two" is matched with the tags in table 5, and the successfully matched IDs are 100001, 100015, 200001, 200006 and 300001.
As described above, in Table 5, the first digit is 1 is the ID of a sentence, the first digit is 2 is the ID of a paragraph, and the first digit is 3 is the ID of an article. Therefore, it can be seen that ID100001 and ID100015 are the IDs of sentences, ID200001 and ID200006 are the IDs of paragraphs, and ID300001 is the ID of an article.
From any of these 5 IDs, the entire content of the article can be known.
The method for querying the whole content of the article according to ID100001 is as follows:
in table 5, it can be found that ID100001 is 200001 and ID200001 is found in table 5, and the metadata 300001 and ID300001 do not include the ID of the previous level data, so ID300001 is the highest level data. The data warehouse is queried for specific data for ID 300001. This enables the article content to be restored. Further, information such as an article author can be known from the metadata of the ID 300001.
The method for querying the entire content of the article according to ID100015 is similar to the method for querying the entire content of the article according to ID100001, and is not described again.
The method for querying the whole content of the article according to ID200001 is as follows:
in table 5, the ID of the previous level data of ID200001 is searched, the searched ID is 300001, and the metadata of ID300001 does not include the ID of the previous level data, and therefore ID300001 is the data of the highest hierarchy level. The data warehouse is queried for specific data for ID 300001. This enables the article content to be restored. Further, information such as an article author can be known from the metadata of the ID 300001.
The method for querying the whole content of the article according to the ID200006 is similar to the method for querying the whole content of the article according to the ID200001, and is not described again.
The method for querying the whole content of the article according to the ID300001 is as follows:
in table 5, the ID of the upper-level data of ID300001 is searched, and it is found that the metadata of ID300001 does not include the ID of the upper-level data, and therefore ID300001 is the data of the highest hierarchy level. The data warehouse is queried for specific data for ID 300001. This enables the article content to be restored. Further, information such as an article author can be known from the metadata of the ID 300001.
The metadata comprises the ID of the previous-level data, so that after any one DO is located through the search keyword, the ID of the previous-level DO contained in the metadata of the DO is inquired step by step until the ID of the DO at the highest level is inquired. And then querying a data warehouse corresponding to the ID of the DO at the highest level in the Handle System, and then querying specific data stored in the data warehouse to obtain the specific data of the DO at the highest level.
The inventors have found that the metadata may also contain the ID of the target level data, e.g. as shown in table 6, the target level being the level of the article. The specific data of the DO of the target hierarchy is stored in the data warehouse (the specific data of the DO of ID300001 is stored in the data warehouse). In this way, the DO at the target level can be located by any ID, and the specific data of the DO at the target level can be retrieved from the data warehouse.
TABLE 6
Metadata
Figure RE-GDA0003803756620000231
Figure RE-GDA0003803756620000241
The search keyword is "keyword two", the keyword "keyword two" is matched with the tags in the table 6, and the successfully matched IDs are 100001, 100015, 200001, 200006 and 300001.
As described above, in Table 6, the first digit is 1 is the ID of a sentence, the first digit is 2 is the ID of a paragraph, and the first digit is 3 is the ID of an article. Therefore, it can be seen that ID100001 and ID100015 are the IDs of sentences, ID200001 and ID200006 are the IDs of paragraphs, and ID300001 is the ID of an article.
From any of these 5 IDs, the entire content of the article can be known.
The method for querying the whole content of the article according to ID100001 is as follows:
in table 6, it can be found that ID of the target hierarchy data included in the metadata of ID100001 is 300001. The data warehouse is queried for specific data for ID 300001. This enables the article content to be restored. Further, information such as an article author can be known from the metadata of the ID 300001.
The method for querying the entire content of the article according to ID100015 is as follows:
in table 6, it can be found that ID of the target hierarchy data included in the metadata of ID100015 is 300001. The data warehouse is queried for specific data for ID 300001. This enables the article content to be restored. Further, information such as an article author can be known from the metadata of the ID 300001.
The method for querying the whole content of the article according to ID200001 is as follows:
in table 6, it can be found that ID of the target hierarchy data included in the metadata of ID200001 is 300001. The data warehouse is queried for specific data for ID 300001. This enables the article content to be restored. Further, information such as an article author can be known from the metadata of the ID 300001.
The method for querying the whole content of the article according to the ID200006 is as follows:
in table 6, it can be found that the ID of the target hierarchy data included in the metadata of ID200006 is 300001. The data warehouse is queried for specific data for ID 300001. This enables the article content to be restored. Further, information such as an article author can be known from the metadata of the ID 300001.
The method for querying the whole content of the article according to the ID300001 is as follows:
in table 6, it can be queried that the ID of the target hierarchy data included in the metadata of ID300001 is 300001. The data warehouse is queried for specific data for ID 300001. This enables the article content to be restored. Further, information such as an article author can be known from the metadata of the ID 300001.
In the present application, the hierarchy information can be added very easily. For example, as shown in table 4, the highest level data is the level of the article, and ID300001 corresponds to article 1. There are other articles on the web page, for example, article 2. Assuming that article 1 is crawled first, it is ranked according to the content of article 1 and stored hierarchically. And then the article 2 is crawled, and the article 2 is also graded and stored according to the content of the article 2. Article 1 and article 2 originate from the same web page (for ease of description, assume web page 1). Now, to add the information of the web page 1 to the information of the article 1, it is only necessary to add the ID of the web page 1 to the column of the ID of the upper level data of the ID300001 and then add the information of the web page. For example, the ID of web page 1 is 400001. "400001" is added to the blank column in the last row of table 4, and then the relevant information of ID400001 is added (assuming that web page 1 includes article 1 and article 2, and ID of article 2 is 300002). Therefore, the method and the device can flexibly deal with the added information level, the workload of modification is very small, and the modification is very convenient. The root reason is that the metadata of each level of data includes the data ID, the ID of the upper level data, and/or the ID of the lower level data. The added level only affects the highest level of data at present and does not affect the data of other levels, so the metadata of the data of other levels does not need to be changed. The information of the article 1 in the registration system and the data warehouse is not influenced by the addition of the webpage information.
In the conventional internet, information is organized in a data manner and is searched by an address (such as url). Different from the traditional internet, the information in the DO system is stored in a DO mode, each DO comprises metadata and specific data, and the corresponding DO can be searched through the content instead of the address.
In the application, data are classified, and searching with finer granularity can be achieved during searching.
For example, suppose that an article has 5 natural segments, which are 18 sentences, wherein the 5 natural segments respectively include the following sentences: 1. 3, 4, 6 and 4. Each sentence has a keyword, each paragraph has a keyword, and the entire article has a keyword.
AAA
BBB HHH HHH
CCC DDD FFFF DDD
BBB DDD AAA EEE BBB BBB
AAA BBB GGG BBB
The keyword of clause 1 is AAA. The keyword in sentence 2 is BBB. The keyword of sentence 3 is HHH. The keyword of sentence 4 is HHH. The keyword of sentence 5 is CCC. The keyword of sentence 6 is DDD. The keyword of sentence 7 is FFF. The keyword of sentence 8 is DDD. The keyword of sentence 9 is BBB. The keyword of sentence 10 is DDD. The keyword of sentence 11 is AAA. The keyword of sentence 12 is EEE. The keyword of sentence 13 is BBB. The keyword of sentence 14 is BBB. The keyword of sentence 15 is AAA. The keyword of sentence 16 is BBB. The keyword of sentence 17 is GGG. The keyword of sentence 18 is BBB.
The keyword in paragraph 1 is AAA. The keyword in paragraph 2 is HHH. The keyword of paragraph 3 is DDD. The keyword in paragraph 4 is BBB. The keyword of paragraph 5 is BBB.
The full-text keyword is BBB.
If the article is not ranked, then the article has only one keyword: BBB. When the search keyword is "BBB", the article can be retrieved; the article cannot be retrieved when the search keyword is any other keyword, for example, the article cannot be retrieved when the search keyword is "AAA".
The articles are classified according to paragraphs, and the articles have a keyword at a full text level: BBB, also with paragraph level keywords: AAA, HHH, DDD, BBB. Among them, "BBB" is both a keyword at full-text level and keywords at paragraphs 4 and 5. When searching for the keyword "AAA", segment 1 can be retrieved; when the keyword "HHH" is searched, the 2 nd paragraph can be retrieved; when searching for the keyword "DDD", the 3 rd paragraph can be retrieved; when searching for the keyword "BBB", paragraphs 4, 5 and the article can be retrieved. According to the method above, after a paragraph is retrieved, the article can be retrieved by locating the article where the paragraph is located by using the ID of the data of the previous level in the metadata of the DO at the level of the paragraph.
Therefore, the articles are graded according to paragraphs, and keyword retrieval at the paragraph level is adopted during retrieval, so that the probability of the articles being retrieved can be effectively increased.
The method comprises the following steps of grading an article according to paragraphs, and grading the paragraphs according to sentences, wherein the article has a keyword at a full text level: BBB, also with paragraph level keywords: AAA, HHH, DDD, BBB, and also have keywords at sentence level: AAA, BBB, HHH, CCC, DDD, FFFF, EEE, GGG.
Here, "BBB" is a full-text level keyword, the keywords in paragraphs 4 and 5, and the keywords in sentences 2, 9, 13, 14, 16 and 18.
When the keyword "AAA" is searched, sentences 1, 11, and 15 can be retrieved. When the keyword "HHH" is searched, sentences 3 and 4 can be retrieved. When searching for the keyword "CCC," sentence 5 can be retrieved. When the keyword "DDD" is searched, sentences 6, 8, and 10 can be retrieved. When searching for the keyword "FFF", sentence 7 can be retrieved. When the keyword "GGG" is searched, sentence 17 can be retrieved.
It can be seen that the benefit of extracting keywords from each level of data is: the probability of the articles being retrieved can be greatly improved during searching.
As an alternative embodiment, after the data searching step, correlation analysis may be performed according to the search result, that is, DO is correlated according to the requirement of the user.
The angles of the information which is seen by each user are different, in order to extract the information with different emphasis points, the user proposes a theme aiming at the problem concerned by the user, lists a group of customized keywords (namely search keywords) according to the theme, the system can search in a digital object architecture according to the keywords, connects the digital object related to the keywords with the keywords DO, and finally connects all the keywords DO to form the information network constructed by the theme.
In the application, the search keywords can be customized according to the actual requirements of the user.
The organization name can be used as a search keyword to search articles of different organizations, and the articles of different organizations are compared and analyzed to judge the subjective position of the articles on certain subjects.
The name of the author can be used as a search keyword to search articles published by the author and observe the change of the concerned field of the articles over time.
And finally, a knowledge graph in the field can be established, and the correlation of the entities is mined.
Example 1
With the exponential growth of internet information, data increasingly presents the characteristics of fragmentation and decentralization, valuable information is often buried in a large amount of disordered information, and the value of the information can be reflected after effective association. Therefore, the application hopes to utilize a novel internet architecture to provide an effective information organization way to identify and connect potentially valuable information.
At present, the organization of information mainly depends on a search engine or a crawler + manual judgment mode, but the mode is built on a traditional internet architecture rather than a DOA-based information management method. The DOA was proposed by the charling prize owner Robert Khan and Vincent Cerf, both of which are also inventors of TCP/IP technology. TCP/IP is an internet architecture with end-to-end transmission as the core, whereas DOA is an internet architecture with data as the core, and focuses more on the organization and processing of data. The method is more suitable for organizing, connecting and analyzing massive fragmented data.
The present application divides the organization of information into four steps.
The first step requires gathering data in the network. The method mainly aims at multi-element, multi-mode and heterogeneous data, such as network articles, videos and subtitles thereof, picture charts, words in a social network, files containing pictures or characters and external links contained in the information. When collecting, the organization or author to which the information belongs, the generation time and the title need to be marked, and if possible, the target position related to the information needs to be collected. The manner of gathering includes automated gathering and gathering with human involvement, such as distributed web crawlers or related experts well known in the art.
The second step needs to carry out structured sorting and analysis on the collected data and extract fragmented information. Different information types have different arrangements. The arrangement mode can be realized by means of natural language processing technology (NLP), manual accurate labeling or a processing means combining manual operation and automation.
The simplest mode is to organize the information according to the granularity, and for the character information of the website, the tree structure can be arranged from top to bottom. The top level is website information, the second level is webpage information, the webpage contains different articles, and sentences in the articles are used as the minimum units of the digital objects.
In addition, the arrangement can be carried out according to the argument of the news, the general argument of a piece of news is labeled firstly, and then each point of argument is classified into an information unit according to the discussion of the point of argument. Organization by elements of news is also a feasible way to include time, place, people and agenda of news. It is worth mentioning that if news is particularly focused on certain elements, it may also be divided into different information units, for example, each one may be considered as an information unit in case of presence of a plurality of important persons.
In addition to news, the information may be composed in the form of files, such as a report containing a plurality of files, in which case the report may be regarded as superior information, each file being a sub-information it contains. In addition to a file, a web page may contain multiple links, and each link may be considered a unit of sub-information.
The third step entails reorganizing the fragmented information with a digital object architecture.
First, the information structure organized in the second step is organized in the form of a hierarchy DO. The smallest unit of the DO is a sentence, a file, or a link. The upper layer DO is then a paragraph, a point of discourse or a news element comprising this sentence. The upper level DO containing paragraphs is an article or a report. A plurality of articles are arranged in the same level as the lower level DO of the web page. The top layer is the entire web site, containing multiple web pages.
Then, the DO information is divided into three parts and put into three components of the DOA. Firstly, the Meta information of the DO, namely the description information (Meta Data) is put into the Registry, and secondly, the unique identification of each DO is registered into the Handle System and one reository is allocated for the DO to store. Finally, the data for the DO is placed in the hierarchy.
The fourth step is to correlate the DOs according to the user's needs. The angles of the information to be seen by each user are different, in order to extract the information with different emphasis points, the user provides a theme aiming at the problem concerned by the user, a group of customized keywords are listed according to the theme, the system can search in a digital object system structure according to the keywords, the digital object related to the keywords is connected with the keywords DO, and finally all the keywords DO are connected to form the information network constructed by the theme.
Example 2
The information analysis is carried out by utilizing a DOA system, and the method mainly comprises the following four steps: step one, information acquisition; step two, structured analysis; thirdly, reorganizing based on DOA; and step four, correlation analysis.
Step one, information acquisition
In the embodiment, an automatic crawler mode is mainly used in information acquisition, and an information source which is focused on is firstly determined, such as websites of various big news media.
After the information source is determined, a crawling range needs to be determined, such as whether to track external links in a target website, the number of crawling layers, whether to crawl files or pictures, and the like. Usually, as much relevant information as possible is gathered, since the searched data is likely to be the source of information for the digital object DO generated in the next few steps.
Step two, structured analysis
Structured analysis is a very important step in information management, and is helpful for fine-grained analysis and management of data. Here, related techniques of natural language processing, such as lexical analysis, syntactic analysis, dependency analysis, and the like, are required.
In this example, the structural analysis is carried out in the simplest manner, i.e. paragraph by sentence. Paragraph parsing is simple, and only the document of the article needs to be segmented according to the carriage return symbol. However, sentences cannot be segmented by periods as paragraphs, and some abbreviations such as u.s.a also contain periods, so that sentence segmentation is assisted by the abbreviation corpus.
In addition to segmenting paragraphs and sentences of an article, it is desirable to extract attributes from the article. Such as the author of the article, the organization of the article, etc. If the knowledge base in the professional field exists, entities in the sentence can be labeled, such as a place and a unit. This facilitates subsequent information organization and organization.
Step three, reorganizing based on DOA
Cordra is used at this step to convert structured information to DO, again as an example of a paper-based source of information. Each article is first treated as a DO, and the author and organization of the article are treated as attributes of the DO. And finally, generating a sentence in the article into a next-level DO, and marking the DOID of the paragraph corresponding to the sentence in the sentence DO as an attribute.
After the DO is generated, metadata is put into Registry according to the flow of the DOA, the DOID is registered in the Handle System, and finally the data is stored in the reproducibility.
Step four, correlation analysis
And finally, performing association analysis by using the generated DO, and fusing, analyzing and mining the information acquired from different sources. For example, the articles of different organizations are compared and analyzed, the subjective positions of the articles on certain subjects are judged, different authors can be tracked, the change of the concerned fields of the articles is observed, and finally, a knowledge graph in the fields can be established, the correlation of the entities is mined, and correlation analysis is carried out. For example, the number of times of the common occurrence of the lung infection and the electronic cigarette is found to be large, and pneumonia and a new crown also frequently occur at the same time, so that the electronic cigarette and the new crown can be connected in a knowledge map for further analysis by experts. FIG. 2 is a diagram illustrating a correlation analysis result according to an example embodiment.
The embodiment of the application also provides a data grading processing device which can execute the data grading processing method.
The apparatus is based on a digital object hierarchy. The digital object system includes: registry, registration system, data warehouse.
As shown in fig. 3, the apparatus includes: a data acquisition unit 10, a data classification unit 20, a digital object hierarchy-based classification registration and storage unit 30, and a data search unit 40.
The data acquisition unit 10 is configured to: and acquiring data to be processed.
The data staging unit 20 is configured to: and grading the data to be processed to obtain graded data.
The hierarchical registration and storage unit 30 based on the digital object hierarchy is used to: determining digital objects of all levels according to the classified data, wherein the digital objects comprise metadata and specific data, and the metadata comprises the classification information of the digital objects, so that the digital objects are convenient to retrieve in the future; for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identifier of the digital object in a registration system and assigning a data repository; and storing specific data of the digital object into a data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a tag of the digital object, and the tag of the digital object is used for matching with a search keyword when a user searches the digital object.
The data search unit 40 is configured to: determining a search keyword; matching the search keywords with the tags in the metadata, screening out the digital objects corresponding to the tags which are successfully matched, and acquiring unique identifiers of the digital objects; inquiring a data warehouse for storing specific data of the digital object from the registration system according to the unique identifier; specific data of the digital object is obtained from the queried data warehouse.
Optionally, the metadata further comprises hierarchical data comprising unique identifications of digital objects at a previous and/or next level.
Optionally, the apparatus further comprises a data search unit, the data search unit being configured to: determining the hierarchical data in the metadata of the digital object corresponding to the successfully matched tag; determining the digital object of the previous level step by step according to the level data until determining the digital object of the highest level; determining all digital objects of the next level step by step according to the digital object of the highest level until all digital objects of the lowest level are determined; querying a data warehouse for storing all digital objects at the lowest hierarchical level from a registration system; all of the digital objects at the lowest hierarchical level are queried for specific data from the data warehouse. Optionally, the data searching unit is configured to: determining the hierarchical data in the metadata of the digital object corresponding to the successfully matched tag; determining a digital object of a highest hierarchical level according to the hierarchical data; querying a data repository storing the highest level digital object from the registry; specific data of the digital object of the highest hierarchical level is queried from the data warehouse.
Optionally, the unique identifiers of the data objects of different hierarchies are distinguished in a preset manner.
Optionally, the apparatus further comprises: a query unit to: and restoring the content of the data to be processed according to the inquired specific data of the digital objects of the multiple hierarchies and the hierarchy data in the metadata.
Optionally, the apparatus further comprises an adding unit, the adding unit being configured to: determining the digital object of the previous level step by step according to the level data in the metadata until determining the digital object of the highest level; and adding the newly added hierarchy data to the metadata of the digital object at the highest hierarchy level.
FIG. 4 is a block diagram of an electronic device shown in accordance with an example embodiment.
An electronic device 400 according to this embodiment of the present application is described below with reference to fig. 4. The electronic device 400 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 4, electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: at least one processing unit 410, at least one memory unit 420, a bus 430 that connects the various system components (including the memory unit 420 and the processing unit 410), a display unit 440, and the like.
Wherein the storage unit stores program code that can be executed by the processing unit 410, such that the processing unit 410 performs the steps according to various exemplary embodiments of the present application described in the present specification. For example, the processing unit 410 may perform the steps as shown in fig. 1.
The storage unit 420 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM) 4201 and/or a cache memory unit 4202, and may further include a read only memory unit (ROM) 4203.
The memory unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 430 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 400 may also communicate with one or more external devices 400' (e.g., keyboard, pointing device, bluetooth device, etc.) such that a user can communicate with devices with which the electronic device 400 interacts, and/or any devices (e.g., router, modem, etc.) with which the electronic device 400 can communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 450. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 460. The network adapter 460 may communicate with other modules of the electronic device 400 via the bus 430. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, as shown in fig. 5, the technical solution according to the embodiment of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present application.
The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).
The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring data to be processed; grading the data to be processed to obtain graded data; determining digital objects of various hierarchies according to the hierarchical data, wherein the digital objects comprise metadata and specific data; for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identification of the digital object in the registration system and assigning a data repository; and storing specific data of the digital object into the data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a label of the digital object, and the label of the digital object is used for matching with a search keyword when a user searches the digital object.
Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present application.
Exemplary embodiments of the present application are specifically illustrated and described above. It is to be understood that the application is not limited to the details of construction, arrangement, or method of implementation described herein; on the contrary, the intention is to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method for hierarchical processing of data, the method comprising:
acquiring data to be processed;
grading the data to be processed to obtain graded data;
determining digital objects of various hierarchies according to the hierarchical data, wherein the digital objects comprise metadata and specific data;
for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identifier of the digital object in a registration system and allocating a data warehouse; storing specific data of the digital object into the data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a tag of the digital object, and the tag of the digital object is used for matching with a search keyword when a user searches the digital object;
determining a search keyword;
matching the search keywords with the tags in the metadata, screening out the digital objects corresponding to the tags which are successfully matched, and acquiring unique identifiers of the digital objects;
querying a data warehouse for storing specific data of the digital object from the registration system according to the unique identifier;
specific data of the digital object is obtained from the queried data warehouse.
2. The data hierarchy processing method of claim 1, the metadata further includes hierarchy data, the hierarchy data includes unique identifications of digital objects at a previous hierarchy and/or a next hierarchy, and the unique identifications of data objects at different hierarchies are distinguished using a preset manner.
3. The data staging method of claim 2, further comprising:
determining the hierarchical data in the metadata of the digital object corresponding to the successfully matched label;
determining the digital object of the previous level step by step according to the level data until determining the digital object of the highest level;
determining all digital objects of the next level step by step according to the digital object of the highest level until all digital objects of the lowest level are determined;
querying a data repository storing all digital objects at the lowest hierarchical level from the registry;
all of the lowest hierarchical levels of digital object are queried for specific data from the data warehouse.
4. The data staging method of claim 2, further comprising:
determining the hierarchical data in the metadata of the digital object corresponding to the successfully matched tag;
determining a digital object of a highest hierarchical level from the hierarchical data;
querying a data repository storing the digital object at the highest hierarchical level from the registry;
specific data of the digital object of the highest hierarchical level is queried from the data warehouse.
5. The data staging method of any one of claims 2-4, wherein the method further comprises:
determining the digital object of the previous level step by step according to the level data in the metadata until determining the digital object of the highest level;
and adding the newly added hierarchy data to the metadata of the digital object at the highest hierarchy level.
6. A data staging apparatus, the apparatus comprising: a data acquisition unit, a data classification unit, a classification registration and storage unit based on a digital object system, a search unit,
the data acquisition unit is configured to: acquiring data to be processed;
the data classification unit is configured to: grading the data to be processed to obtain graded data;
the hierarchical registration and storage unit based on the digital object hierarchy is used for: determining digital objects of various hierarchies according to the hierarchical data, wherein the digital objects comprise metadata and specific data; for each level of the digital object, performing the steps of: storing metadata of the digital object in a registry; registering the unique identification of the digital object in the registration system and assigning a data repository; storing specific data of the digital object into the data warehouse, wherein the specific data is information reflecting the content of the data to be processed, the metadata is used for describing the digital object, the metadata comprises a unique identifier of the digital object and a label of the digital object, the label of the digital object is used for matching with a search keyword when a user searches the digital object,
the search unit is configured to: determining a search keyword; matching the search keywords with the tags in the metadata, screening out the digital objects corresponding to the tags which are successfully matched, and acquiring unique identifiers of the digital objects; querying a data warehouse for storing specific data of the digital object from the registration system according to the unique identifier; data specific to the digital object is obtained from the queried data repository.
7. The apparatus according to claim 6, wherein the metadata further comprises hierarchical data comprising unique identifiers of digital objects at a previous and/or next level, the unique identifiers of data objects at different levels being distinguished using a predetermined manner.
8. The data hierarchy processing apparatus of claim 7, the apparatus further comprising an adding unit operable to: determining the digital object of the previous level step by step according to the level data in the metadata until determining the digital object of the highest level; and adding the newly added hierarchy data to the metadata of the digital object at the highest hierarchy level.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement a data hierarchy processing method as recited in any one of claims 1-5.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out a data rating processing method according to any one of claims 1 to 5.
CN202210845080.5A 2022-07-19 2022-07-19 Data grading processing method and device, electronic equipment and computer readable medium Pending CN115168401A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210845080.5A CN115168401A (en) 2022-07-19 2022-07-19 Data grading processing method and device, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210845080.5A CN115168401A (en) 2022-07-19 2022-07-19 Data grading processing method and device, electronic equipment and computer readable medium

Publications (1)

Publication Number Publication Date
CN115168401A true CN115168401A (en) 2022-10-11

Family

ID=83495188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210845080.5A Pending CN115168401A (en) 2022-07-19 2022-07-19 Data grading processing method and device, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN115168401A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290401A (en) * 2023-11-23 2023-12-26 北京富算科技有限公司 Data transaction method and system
CN117971951A (en) * 2024-04-02 2024-05-03 北京大数据先进技术研究院 Heterogeneous registry-oriented digital object metadata interoperation method, device, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290401A (en) * 2023-11-23 2023-12-26 北京富算科技有限公司 Data transaction method and system
CN117290401B (en) * 2023-11-23 2024-03-15 北京富算科技有限公司 Data transaction method and system
CN117971951A (en) * 2024-04-02 2024-05-03 北京大数据先进技术研究院 Heterogeneous registry-oriented digital object metadata interoperation method, device, equipment and medium
CN117971951B (en) * 2024-04-02 2024-07-02 北京大数据先进技术研究院 Heterogeneous registry-oriented digital object metadata interoperation method, device, equipment and medium

Similar Documents

Publication Publication Date Title
KR100882582B1 (en) System and method for research information service based on semantic web
Bizer et al. Dbpedia-a crystallization point for the web of data
US8166013B2 (en) Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
US8554800B2 (en) System, methods and applications for structured document indexing
US6691105B1 (en) System and method for geographically organizing and classifying businesses on the world-wide web
Andrews et al. A classification of semantic annotation systems
US20090077094A1 (en) Method and system for ontology modeling based on the exchange of annotations
Gentile et al. Unsupervised wrapper induction using linked data
CN115168401A (en) Data grading processing method and device, electronic equipment and computer readable medium
CN101655862A (en) Method and device for searching information object
US8359307B2 (en) Method and apparatus for building sales tools by mining data from websites
US20130283231A1 (en) Method and System for Compiling a Unique Sample Code for an Existing Digital Sample
Nesi et al. Geographical localization of web domains and organization addresses recognition by employing natural language processing, Pattern Matching and clustering
Yi et al. Revisiting the syntactical and structural analysis of Library of Congress Subject Headings for the digital environment
CN102968469A (en) Method and system for building application index, and method and system for application indexes
WO2009054611A1 (en) System and method for managing information map
Ye et al. Learning object models from semistructured web documents
Cortez et al. A flexible approach for extracting metadata from bibliographic citations
Spahiu et al. Topic profiling benchmarks in the linked open data cloud: Issues and lessons learned
CN109948015B (en) Meta search list result extraction method and system
LIM et al. Web mining-The ontology approach
Moura et al. Integration of linked data sources for gazetteer expansion
Manguinhas et al. A geo-temporal web gazetteer integrating data from multiple sources
Jung Towards open decision support systems based on semantic focused crawling
Sabou et al. Towards improving web service repositories through semantic web techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination