CN110765106A - Data information processing method and system based on visual features - Google Patents

Data information processing method and system based on visual features Download PDF

Info

Publication number
CN110765106A
CN110765106A CN201911009498.7A CN201911009498A CN110765106A CN 110765106 A CN110765106 A CN 110765106A CN 201911009498 A CN201911009498 A CN 201911009498A CN 110765106 A CN110765106 A CN 110765106A
Authority
CN
China
Prior art keywords
data
text
image
information processing
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911009498.7A
Other languages
Chinese (zh)
Inventor
丁芳桂
郑创伟
邵晓东
赵捍东
谢志成
何亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Creative Smart Port Technology Co Ltd
Shenzhen Newspaper Group E Commerce Co Ltd
SHENZHEN PRESS GROUP
Original Assignee
Shenzhen Creative Smart Port Technology Co Ltd
Shenzhen Newspaper Group E Commerce Co Ltd
SHENZHEN PRESS GROUP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Creative Smart Port Technology Co Ltd, Shenzhen Newspaper Group E Commerce Co Ltd, SHENZHEN PRESS GROUP filed Critical Shenzhen Creative Smart Port Technology Co Ltd
Priority to CN201911009498.7A priority Critical patent/CN110765106A/en
Publication of CN110765106A publication Critical patent/CN110765106A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data information processing method and a system based on visual characteristics, wherein the data information processing method based on the visual characteristics comprises the following steps: the method comprises the steps of collecting image-text data based on visual characteristics through a network, carrying out data preprocessing on the image-text data to obtain target data meeting a first requirement, carrying out big data comprehensive processing on the target data to obtain effective data meeting a second requirement, and establishing, updating and/or resetting an image-text database according to the effective data. Through the mode, the multi-level data cleaning processing process can be realized, various effects such as accuracy, completeness, consistency, uniqueness, timeliness and effectiveness can be realized from data, the problems of data loss, inconsistency, repetition and the like can be effectively solved, organic integration of image-text data is finally realized, a comprehensive and comprehensive image-text database is obtained, and the integration and transformation upgrading of the industry are facilitated.

Description

Data information processing method and system based on visual features
Technical Field
The application relates to the technical field of data processing, in particular to a data information processing method based on visual characteristics and a system applying the data information processing method based on the visual characteristics.
Background
The news publishing industry considers "mansion" as the earliest newspaper in China. "mansion" originally refers to the residence of the ancient officer on a pilgrimage on a major in the Beijing, which appears early in the war country, and also people say it starts in Western Han. The color master is said to be in ancient: "Guochen mansion" in county, Kyoho. Mansion, to say also, is ascribed. "mansion" was later used as an office of local officer resident in Beijing to transmit communication messages, and is called "mansion newspaper". The mansion newspaper is also called mansion copy, and is also called ward newspaper, bar newspaper and miscellaneous newspaper, which is a kind of bulletin news for reporting, and is specially used for court to report administrative texts and political information, belonging to news copy.
The immediate origin of modern newspapers is the printed newsprint (a single individual newsletter) that began to appear in germany in the 15 th century. Frankfurter news, a periodical of 1615, is generally considered the first "real" newspaper, because the newspaper has a fixed name, is published periodically every week, and has several news items printed on each paper rather than a single news item (but the newspaper is printed on one side). The English word "Newspaper" (Newspaper) was first appeared in 1665 in the first British Newspaper "Oxford bulletin". The earliest journal of New York News appeared in Laybi tin, Germany in 1650, but the journal became the leading role of the newspaper, after the 18 th century. The popularity of daily newspapers, which mark the maturity of the news industry in a country or region, has raised high requirements for the continuous publishing of daily newspapers, the collection and transmission of information, printing technology, the quality of news personnel and the level of management personnel.
On the other hand, with the rapid development of information technology, newspapers have gradually expanded from paper form to electronic form, which greatly facilitates users, but poses a great challenge to the conventional media industry.
Meanwhile, in order to grasp the opportunity of major industrial policies for the nation and the local to greatly promote the development of the cultural industry, and seize the scientific and technological system high points of the industry, more and more media industries need to realize industrial application in order to realize industrial upgrading and improve competitiveness, and the transformation upgrading of the media industries, the content aggregation of the cultural and industrial industries and the value of content mining are promoted.
However, in the prior art, a systematic and comprehensive mode is lacked for data processing in the media industry, the data accuracy is low, authenticity is difficult to distinguish, a unified database does not manage all current media data, and transformation and upgrading of the industry are difficult to realize.
In view of various defects in the prior art, the inventors of the present application have made extensive studies to provide a data information processing method and system based on visual characteristics.
Disclosure of Invention
The data information processing method and system based on the visual characteristics can achieve multi-level data cleaning processing, achieve various effects such as accuracy, completeness, consistency, uniqueness, timeliness and effectiveness from data, effectively solve problems of data loss, inconsistency and repetition, finally achieve organic integration of image-text data, obtain a comprehensive and comprehensive image-text database, and are beneficial to industry integration and transformation upgrading.
In order to solve the above technical problem, the present application provides a data information processing method based on visual characteristics, as one embodiment, the data information processing method based on visual characteristics includes:
acquiring image-text data based on visual characteristics through a network;
carrying out data preprocessing on the image-text data to obtain target data meeting a first requirement;
carrying out big data comprehensive processing on the target data to obtain effective data meeting a second requirement;
and establishing, updating and/or resetting the image-text database according to the effective data.
As one of the embodiments:
the step of performing data preprocessing on the image-text data specifically comprises:
carrying out data preprocessing of data cleaning, extraction, processing and/or classification on the image-text data; the step of performing big data comprehensive processing on the target data specifically includes:
and performing large data comprehensive processing of sensitive content filtering, overlapped content filtering and/or keyword extraction on the target data.
As an implementation manner, the data information processing method based on the visual characteristics stores the image-text data, the target data and/or the valid data through an HDFS distributed file system based on a Hadoop distributed system architecture.
As an implementation manner, the step of storing the image-text data, the target data and/or the valid data by using an HDFS distributed file system based on a Hadoop distributed system architecture specifically includes:
putting a URL set to be captured through a URL library to be captured, wherein the URL set records a text file of the URL to be captured and is used as an entrance for a crawler to enter an internet network;
the method comprises the steps that HTML information of the captured original webpage is stored through an original webpage stock, wherein the storage form comprises the steps that a URL is stored with a key value, and webpage HTML information corresponding to the URL is stored with a value;
the method comprises the steps of storing an analyzed chained-out link in a chained-out URL library, wherein the storage form comprises the steps of storing a URL with a key value and storing a chained-out link set contained in a webpage corresponding to the URL with a value;
and storing the XML information which is obtained by capturing the web page and is converted and processed and comprises the image-text data, the target data and the effective data through an XML stock, wherein the storage form comprises storing the URL with a key value and storing the XML information of the web page corresponding to the URL with a value.
As an embodiment, the step of acquiring the visual feature-based graphics and text data through the network specifically includes:
and analyzing a data source query interface by using a recursive element combination mode based on a maximum entropy statistical model to acquire the image-text data from the deep-layer network data.
As an embodiment, the step of acquiring the visual feature-based graphics and text data through the network specifically includes:
and adopting a progressive sampling mode of a model based on a weighted attribute value graph to collect Web database samples so as to collect the image-text data from deep network data.
As an embodiment, the step of acquiring the visual feature-based graphics and text data through the network specifically includes:
data extraction and labeling are carried out by adopting a record alignment model based on attribute value similarity and marks so as to acquire the image-text data from deep-layer network data, wherein the method comprises the following steps:
aligning the attribute values in any two records by adopting a dynamic programming method to ensure that the sum of the similarity of the aligned attribute values is maximum;
aligning the attribute values of all records through global alignment to obtain a suboptimal solution;
and searching the repeated structure in a mode of similar attribute values.
As an embodiment, the step of acquiring the visual feature-based graphics and text data through the network specifically includes:
and adopting a data value labeling mode based on the field to label the attribute value and performing mutual supplement operation of missing data on different data sources so as to acquire the image-text data from deep network data.
In order to solve the above technical problem, the present application further provides a data information processing system based on visual characteristics, which is configured with a processor for executing program data to implement the data information processing method based on visual characteristics as described above.
The data information processing system is also provided with a data and service interface which comprises a data access interface, a data exchange interface, an identity authentication interface and a related system integration interface so as to integrate and interact data with a related system.
The application provides a data information processing method and a system based on visual characteristics, wherein the data information processing method based on the visual characteristics comprises the following steps: the method comprises the steps of collecting image-text data based on visual characteristics through a network, carrying out data preprocessing on the image-text data to obtain target data meeting a first requirement, carrying out big data comprehensive processing on the target data to obtain effective data meeting a second requirement, and establishing, updating and/or resetting an image-text database according to the effective data. Through the mode, the multi-level data cleaning processing process can be realized, various effects such as accuracy, completeness, consistency, uniqueness, timeliness and effectiveness can be realized from data, the problems of data loss, inconsistency, repetition and the like can be effectively solved, organic integration of image-text data is finally realized, a comprehensive and comprehensive image-text database is obtained, and the integration and transformation upgrading of the industry are facilitated.
The foregoing description is only an overview of the technical solutions of the present application, and in order to make the technical means of the present application more clearly understood, the present application may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present application more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.
Drawings
Fig. 1 is a schematic flow chart of an embodiment of a data information processing method based on visual characteristics according to the present application.
Fig. 2 is a schematic structural diagram of an embodiment of the data information processing system based on visual characteristics according to the present application.
FIG. 3 is a system configuration diagram of an embodiment of a data information processing system based on visual characteristics according to the present application.
Detailed Description
To further clarify the technical measures and effects taken by the present application to achieve the intended purpose, the present application will be described in detail below with reference to the accompanying drawings and preferred embodiments.
While the present application has been described in terms of specific embodiments and examples for achieving the desired objects and objectives, it is to be understood that the invention is not limited to the disclosed embodiments, but is to be accorded the widest scope consistent with the principles and novel features as defined by the appended claims.
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a data information processing method based on visual characteristics according to the present application.
The data information processing method based on visual characteristics according to this embodiment may include, but is not limited to, the following steps.
Step S101, collecting image-text data based on visual characteristics through a network;
step S102, carrying out data preprocessing on the image-text data to obtain target data meeting a first requirement;
step S103, carrying out big data comprehensive processing on the target data to obtain effective data meeting a second requirement;
and step S104, establishing, updating and/or resetting an image-text database according to the effective data.
It is easy to understand that, in the embodiment, the image and text information can be collected and the data can be cleaned, extracted, processed and classified through a network, such as an intelligent crawler, and the like, and sensitive content filtering, transcript filtering, keyword extraction and the like are provided, so that creation of a large image-text database in a large amount, intelligent classification management of an existing database and the like can be realized.
It should be noted that, in the present embodiment, the step of performing data preprocessing on the image-text data specifically includes: and carrying out data preprocessing of data cleaning, extraction, processing and/or classification on the image-text data. Correspondingly, the step of performing big data comprehensive processing on the target data in this embodiment specifically includes: and performing large data comprehensive processing of sensitive content filtering, overlapped content filtering and/or keyword extraction on the target data.
It is worth mentioning that, in the data information processing method based on visual characteristics according to the embodiment, the HDFS distributed file system based on the Hadoop distributed system architecture stores the image-text data, the target data and/or the valid data.
Specifically referring to fig. 3, in the embodiment, the step of storing the image-text data, the target data, and/or the valid data by using an HDFS distributed file system based on a Hadoop distributed system architecture may specifically include the following several ways:
the method comprises the steps that a URL set to be captured is stored in a URL library to be captured, and the URL set records a text file of the URL to be captured to be used as an entrance for a crawler to enter an internet network;
the method comprises the steps that the HTML information of the captured original webpage is stored through an original webpage library, wherein the storage form comprises the steps of storing URL with a key value and storing the webpage HTML information corresponding to the URL with a value;
the method comprises the steps that an outbound link obtained through analyzing is stored in an outbound URL library, wherein the storage form comprises the steps of storing the URL with a key value and storing an outbound link set contained in a webpage corresponding to the URL with a value;
and fourthly, storing the XML information which is obtained by capturing the web pages and is converted and processed and comprises the image-text data, the target data and the effective data through an XML stock, wherein the storage form comprises storing the URL with a key value and storing the XML information of the web page corresponding to the URL with a value.
Specifically, the URL library to be captured in this embodiment: the URL set required to be captured in the current layer is stored, and is actually a text file recording URLs to be captured, wherein the URLs are separated by '\ n'. Before the first level of crawling, this text file is a user-submitted set of URL seeds as a crawler's entry into the internet.
Specifically, the original web library of the present embodiment: and storing the original webpage picked up by each layer. The web page is HTML information without any processing, and the storage form is the web page HTML information corresponding to the key value of URL and the value of URL.
Specifically, the present embodiment links out the URL library: and storing the analyzed out-link of each layer in a form that the key value is URL and the value is an out-link collection contained in the webpage corresponding to the URL.
Specifically, the xml library of the present embodiment: and storing the converted XML information of the web pages captured by all the layers. The conversion here corresponds to a preprocessing of the HTML information. The storage form is that key value is URL, value is XML information of the webpage corresponding to the URL.
It should be particularly noted that, for the aspect of deep-layer network content acquisition, the present embodiment may adopt a system technical solution of deep-layer network data integration, which may include two parts: a pretreatment part and an online treatment part. The preprocessing part is responsible for collecting information of a related Web database and comprises a query interface for analyzing a data source and a description Web database module; the online processing module processes user query according to the information collected by the preprocessing module, and comprises a user query/translation module, a data extraction and labeling module, a data sorting module and the like.
It should be noted that, in the present embodiment, the step of acquiring the image-text data based on the visual characteristics through the network specifically includes: and analyzing a data source query interface by using a recursive element combination mode based on a maximum entropy statistical model to acquire the image-text data from the deep-layer network data.
For example, the data source analysis query interface according to the present embodiment may be a recursive element combination analysis query interface: an element is combined with one or several adjacent elements in a certain way to form a large element, and the combination of the large elements is continued until the end. The key problem in recursive element combination is to select the most suitable combination method from all combination methods. Therefore, the present embodiment may adopt a query interface automatic understanding model based on the maximum entropy statistical model: when a query interface is analyzed, the probability of each element combination is evaluated by using a maximum entropy statistical model, and then the maximum entropy statistical model is used for calculating the total probability of the element combinations of the whole form and finding the element combination with the highest probability value as a final element combination.
Specifically, the step of acquiring the visual feature-based image-text data through the network in this embodiment specifically includes: and adopting a progressive sampling mode of a model based on a weighted attribute value graph to collect Web database samples so as to collect the image-text data from deep network data.
For example, in the Web database sample collection of the present embodiment, one Web database D may have n records { t } t1,…,tnAnd m attributes a ═ a1,…,Am}. Web database sample acquisition has two main aspects: the sample must be representative; the sample acquisition efficiency is high. Therefore, the present embodiment employs a progressive sampling method based on a model of a weighted attribute value map. A weighted attribute value graph of a database is an undirected graph G (V, E, W): each node V of V is an attribute value of a certain attribute; if two attribute values appear in one or more records of the database at the same time, an edge E is between two corresponding nodes, and each edge E has a weight W ∈ W to represent the number of records where the two nodes appear together. The degree of a node is the sum of the weights of all its connected edges.
Specifically, the step of acquiring the visual feature-based image-text data through the network in this embodiment specifically includes: performing data extraction and labeling by using a record alignment model based on attribute value similarity and marks to acquire the image-text data from deep-layer network data, wherein the method specifically comprises the following steps:
the method comprises the following steps that firstly, a dynamic programming method is adopted to align attribute values in any two records, so that the sum of the similarity of the aligned attribute values is the maximum;
secondly, aligning the attribute values of all records through global alignment to obtain a suboptimal solution;
and thirdly, searching a repeated structure in an attribute value similarity mode.
For example, in order to extract data units, data objects, emotion words, and an object-emotion combination can be automatically extracted from a small-scale annotation evaluation corpus; then, all candidate opinion units are matched and collected in the test appraisal text by utilizing a unification collocation. Wherein, the matching process is as follows: for all in one collocation UCiMatching the { attribute, sense } pair, { pos _ attribute, sense } pair and { attribute, pos _ sense } pair in order, and returning the candidate data unit if a matching is successful; otherwise, no candidate data is considered to exist and the next unification collocation is tried. In order to increase the matching speed, the present embodiment may first segment sentences from the evaluation text and then process the sentences one by one.
It should be noted that, the candidate data unit in this embodiment may include a large amount of non-candidate data texts, and therefore, the non-candidate data texts must be filtered as much as possible to ensure a high data extraction accuracy. The embodiment can adopt a classification method to filter and clean the candidate data. For example, a Support Vector Machine (SVM) method with a good classification effect of two types is selected, and the classification characteristics adopt data object words and word classes, emotion words and word classes, object-emotion combination matching characteristics, conjunctions used for connecting different data units and the like related to candidate data. The embodiment can regard the real data units in the annotation evaluation corpus as positive (+1) samples, and extract an equivalent number of sentences without data units from the annotation corpus as negative (-1) samples. And after the training is finished, filtering and cleaning the candidate data by the SVM classifier.
It should be noted that the data extraction module of the present embodiment includes the following four consecutive steps: and (4) identifying a query result part, segmenting records, aligning the records and labeling the records. This embodiment can accurately extract data from HTML.
Further, the present embodiment may adopt a record alignment model based on the attribute value similarity and the mark. In the process of aligning records, the model not only solves the mark (tag) of each attribute value, but also fully utilizes the similarity condition of the attribute values. The whole process of the embodiment comprises the following three steps:
step one, recording and aligning: the attribute values between each two different records are aligned. The basic idea is to align the attribute values in the two records by using a dynamic programming method, so that the sum of the similarity of the aligned attribute values is the maximum.
Step two, global alignment: the attribute values of all records are aligned. Given all record alignments, global alignment aligns the record values of all records so that the alignment has the most record alignments. This alignment process is an NP-hard problem, and the present embodiment can find a sub-optimal solution using a divide-and-conquer algorithm.
Step three, repeating the structure processing: the existence of a repeating structure is found for each record. Assuming that a record is generated by a template AB CD, B is a repeating structure since B may occur multiple times in a record. This embodiment can distinguish whether several consecutive attribute values have the same label generated by one repeating structure or different structures and happen to have the same label.
Specifically, the step of acquiring the visual feature-based image-text data through the network in this embodiment specifically includes: and adopting a data value labeling mode based on the field to label the attribute value and performing mutual supplement operation of missing data on different data sources so as to acquire the image-text data from deep network data.
In addition, the data sorting can be adopted in the embodiment, the similarity degree among the records is fully considered in the process of sorting the data, clustering is carried out according to the similarity degree and the attribute importance of the records, only one optimal record of the data of each cluster can obtain a larger rankngscore, and other similar records can correspondingly obtain a smaller rankingscore.
In addition, the extracted document deep-level structure information is considered to be applied to massive cross-modal image-text information retrieval, namely: according to the structure labels corresponding to each module, the massive cross-mode image-text documents are organized and managed, and then document information of related structure labels in a document library is opened according to different browsing permissions granted to users, so that the aim of intelligently managing massive data is fulfilled.
Through the mode, the multi-level data cleaning processing process can be realized, various effects such as accuracy, completeness, consistency, uniqueness, timeliness and effectiveness can be realized from data, the problems of data loss, inconsistency, repetition and the like can be effectively solved, organic integration of image-text data is finally realized, a comprehensive and comprehensive image-text database is obtained, and the integration and transformation upgrading of the industry are facilitated.
Referring to fig. 2 and fig. 3, the present application further provides a data information processing system based on visual characteristics, as an embodiment, the data information processing system is configured with a processor 21, and the processor 21 is configured to execute program data to implement the data information processing method based on visual characteristics as described above.
The data information processing system is also provided with a data and service interface which comprises a data access interface, a data exchange interface, an identity authentication interface and a related system integration interface so as to integrate and interact data with a related system.
Specifically, referring to fig. 3, the data information processing system based on visual characteristics mainly collects picture and text information, cleans, extracts, processes, and classifies data through a network intelligent crawler, provides sensitive content filtering, transcript filtering, keyword extraction, and the like, and realizes a massive image-text large database and intelligent classification management of an existing database.
For example, the logical architecture of the graphic database created, updated and/or reset in the present embodiment may be divided into services such as infrastructure, resource storage, key technology, application function, service invocation, and system interface from bottom to top. Meanwhile, the security and guarantee system and the data standard specification system (CNML) run through the whole system, and may specifically include the following points:
firstly, the infrastructure for operating the image-text database in the embodiment and a media integration big data cloud platform of a certain newspaper group are consistent.
And secondly, data storage, in the embodiment, the data storage realizes the storage and management of all resource data by adopting a mode of combining a relational database and a full-text database, and provides data support for realizing manuscript inquiry and reading by utilizing the service processing capacity of the relational database and the mass storage of the full-text database.
Third, the key technologies of the present embodiment include a metadata technology, a full-text retrieval technology, a content management technology, an application integration technology, an identity authentication and management technology (interface), and the like to provide underlying information resource processing support.
Fourth, the system interface of the present embodiment is based on an open architecture, and provides a comprehensive data and service interface, such as a data access interface, a data exchange interface, an identity authentication interface, and an interface integrated with other systems, so as to facilitate integration and data interaction with other systems. Such as: and the unified authentication and single sign-on of each application system are realized through the butt joint with the unified user management system. And the retrieval and analysis service of big data is provided through a full-text retrieval interface and a text mining structure. And acquiring a hotspot analysis result in a set period through an analysis module of the content propagation effect analysis system, and integrating the hotspot analysis result in the system for displaying.
And fourthly, an application function, namely an application function layer of the embodiment, is a construction core of a content asset management system, and the system mainly provides comprehensive content asset management functions of classification management, navigation management, full-text retrieval, authority management, manuscript statistics and the like.
And fifthly, service invocation, in the embodiment, the service invocation mainly provides access services such as manuscript browsing, manuscript selection, manuscript retrieval and the like for users such as editors and journalists of a certain newspaper group. In the future, a draft selection application interface can be provided for the content authoring and contribution service of each organization of the newspaper industry group, and the manuscript resources displayed in the content asset library are pushed to each editing system for editing operation.
And sixthly, the safety support system and the data standard system longitudinally penetrate through the whole layer, safety configuration settings in different aspects are carried out from the environment layer to the user expression layer, the safety operation of the system is guaranteed, and the data of each layer conforms to the unified data construction standard and specification.
As mentioned above, the processor 21 in this embodiment is configured to collect the graphics and text data based on the visual characteristics through the network;
the processor 21 is configured to perform data preprocessing on the image-text data to obtain target data meeting a first requirement;
the processor 21 is configured to perform big data comprehensive processing on the target data to obtain valid data meeting a second requirement;
the processor 21 is configured to create, update and/or reset a teletext database based on the valid data.
It is easy to understand that, in the embodiment, the image and text information can be collected and the data can be cleaned, extracted, processed and classified through a network, such as an intelligent crawler, and the like, and sensitive content filtering, transcript filtering, keyword extraction and the like are provided, so that creation of a large image-text database in a large amount, intelligent classification management of an existing database and the like can be realized.
It should be noted that, in this embodiment, the processor 21 is configured to perform data preprocessing on the image-text data, and specifically includes: the processor 21 is used for data preprocessing of data cleaning, extraction, processing and/or classification of the image-text data. Correspondingly, the processor 21 in this embodiment is configured to perform big data comprehensive processing on the target data, and specifically includes: and performing large data comprehensive processing of sensitive content filtering, overlapped content filtering and/or keyword extraction on the target data.
It should be noted that the processor 21 according to this embodiment is configured to store the graphics data, the target data, and/or the valid data, specifically, through an HDFS distributed file system based on a Hadoop distributed system architecture.
Specifically referring to fig. 3, in the embodiment, the storing the image-text data, the target data, and/or the valid data by using the HDFS distributed file system based on the Hadoop distributed system architecture may specifically include the following several ways:
in a first mode, the processor 21 is configured to store a URL set to be captured through a URL library to be captured, where the URL set records a text file of the URL to be captured, and the text file is used as an entry for a crawler to enter an internet network;
the processor 21 is configured to put the captured HTML information of the original web page through the original web page library, where the storage form includes storing the URL with a key value and storing the web page HTML information corresponding to the URL with a value;
in a third mode, the processor 21 is configured to store the extracted link obtained through the extracted URL library, where the storage form includes storing the URL with a key value and storing the extracted link set included in the web page corresponding to the URL with a value;
in a fourth mode, the processor 21 is configured to put, through an XML library, the captured web page and the converted XML information including the image-text data, the target data, and the valid data, where the storage form includes storing the URL with a key value and storing the XML information of the web page corresponding to the URL with a value.
Specifically, the URL library to be captured in this embodiment: the URL set required to be captured in the current layer is stored, and is actually a text file recording URLs to be captured, wherein the URLs are separated by '\ n'. Before the first level of crawling, this text file is a user-submitted set of URL seeds as a crawler's entry into the internet.
Specifically, the original web library of the present embodiment: and storing the original webpage picked up by each layer. The web page is HTML information without any processing, and the storage form is the web page HTML information corresponding to the key value of URL and the value of URL.
Specifically, the present embodiment links out the URL library: and storing the analyzed out-link of each layer in a form that the key value is URL and the value is an out-link collection contained in the webpage corresponding to the URL.
Specifically, the xml library of the present embodiment: and storing the converted XML information of the web pages captured by all the layers. The conversion here corresponds to a preprocessing of the HTML information. The storage form is that key value is URL, value is XML information of the webpage corresponding to the URL.
It should be particularly noted that, for the aspect of deep-layer network content acquisition, the present embodiment may adopt a system technical solution of deep-layer network data integration, which may include two parts: a pretreatment part and an online treatment part. The preprocessing part is responsible for collecting information of a related Web database and comprises a query interface for analyzing a data source and a description Web database module; the online processing module processes user query according to the information collected by the preprocessing module, and comprises a user query/translation module, a data extraction and labeling module, a data sorting module and the like.
It should be noted that, in this embodiment, the processor 21 is configured to acquire the graphics and text data based on the visual characteristics through a network, and specifically includes: the processor 21 is configured to analyze the data source query interface by using a maximum entropy statistical model and a recursive element combination method to collect the graphics and text data from the deep web data.
For example, the data source analysis query interface according to the present embodiment may be a recursive element combination analysis query interface: an element is combined with one or several adjacent elements in a certain way to form a large element, and the combination of the large elements is continued until the end. The key problem in recursive element combination is to select the most suitable combination method from all combination methods. Therefore, the present embodiment may adopt a query interface automatic understanding model based on the maximum entropy statistical model: when a query interface is analyzed, the probability of each element combination is evaluated by using a maximum entropy statistical model, and then the maximum entropy statistical model is used for calculating the total probability of the element combinations of the whole form and finding the element combination with the highest probability value as a final element combination.
Specifically, the processor 21 in this embodiment is configured to acquire, through a network, image-text data based on visual features, and specifically includes: the processor 21 is configured to perform sample collection of a Web database in a progressive sampling manner based on a model of a weighted attribute value graph, so as to collect the image-text data from deep-layer network data.
For example, in the Web database sample collection of the present embodiment, one Web database D may have n records { t } t1,…,tnAnd m attributes a ═ a1,…,Am}. Web database sample acquisition has two main aspects: the sample must be representative; the sample acquisition efficiency is high. Therefore, the present embodiment employs a progressive sampling method based on a model of a weighted attribute value map. A weighted attribute value graph of a database is an undirected graph G (V, E, W): each node V of V is an attribute value of a certain attribute; if two attribute values appear in one or more records of the database at the same time, an edge E is between two corresponding nodes, and each edge E has a weight W ∈ W to represent the number of records where the two nodes appear together. The degree of a node is the sum of the weights of all its connected edges.
Specifically, the processor 21 in this embodiment is configured to acquire, through a network, image-text data based on visual features, and specifically includes: the processor 21 is configured to perform data extraction and labeling by using a record alignment model based on attribute value similarity and a mark, so as to collect the image-text data from deep-layer web data, where the process specifically includes the following steps:
the first process is that the processor 21 is configured to align the attribute values in any two records by using a dynamic programming method, so that a sum of similarity of the aligned attribute values is maximum;
the second process, the processor 21 is configured to align the attribute values of all records through global alignment to obtain a suboptimal solution;
and thirdly, the processor 21 is configured to search the repetitive structure by means of similarity value of the attribute values.
For example, in order to extract data units, data objects, emotion words, and an object-emotion combination can be automatically extracted from a small-scale annotation evaluation corpus; then, all candidate opinion units are matched and collected in the test appraisal text by utilizing a unification collocation. Wherein, the matching process is as follows: for all in one collocation UCiMatching the { attribute, sense } pair, { pos _ attribute, sense } pair and { attribute, pos _ sense } pair in order, and returning the candidate data unit if a matching is successful; otherwise, no candidate data is considered to exist and the next unification collocation is tried. In order to increase the matching speed, the present embodiment may first segment sentences from the evaluation text and then process the sentences one by one.
It should be noted that, the candidate data unit in this embodiment may include a large amount of non-candidate data texts, and therefore, the non-candidate data texts must be filtered as much as possible to ensure a high data extraction accuracy. The embodiment can adopt a classification method to filter and clean the candidate data. For example, a Support Vector Machine (SVM) method with a good classification effect of two types is selected, and the classification characteristics adopt data object words and word classes, emotion words and word classes, object-emotion combination matching characteristics, conjunctions used for connecting different data units and the like related to candidate data. The embodiment can regard the real data units in the annotation evaluation corpus as positive (+1) samples, and extract an equivalent number of sentences without data units from the annotation corpus as negative (-1) samples. And after the training is finished, filtering and cleaning the candidate data by the SVM classifier.
It should be noted that the data extraction module of the present embodiment includes the following four consecutive steps: and (4) identifying a query result part, segmenting records, aligning the records and labeling the records. This embodiment can accurately extract data from HTML.
Further, the present embodiment may adopt a record alignment model based on the attribute value similarity and the mark. In the process of aligning records, the model not only solves the mark (tag) of each attribute value, but also fully utilizes the similarity condition of the attribute values. The whole process of the embodiment comprises the following three steps:
step one, recording and aligning: the attribute values between each two different records are aligned. The basic idea is to align the attribute values in the two records by using a dynamic programming method, so that the sum of the similarity of the aligned attribute values is the maximum.
Step two, global alignment: the attribute values of all records are aligned. Given all record alignments, global alignment aligns the record values of all records so that the alignment has the most record alignments. This alignment process is an NP-hard problem, and the present embodiment can find a sub-optimal solution using a divide-and-conquer algorithm.
Step three, repeating the structure processing: the existence of a repeating structure is found for each record. Assuming that a record is generated by a template AB CD, B is a repeating structure since B may occur multiple times in a record. This embodiment can distinguish whether several consecutive attribute values have the same label generated by one repeating structure or different structures and happen to have the same label.
Specifically, the processor 21 in this embodiment is configured to acquire, through a network, image-text data based on visual features, and specifically includes: the processor 21 is configured to perform attribute value labeling in a data value labeling manner based on a domain and perform mutual complementation of missing data on different data sources, so as to collect the image-text data from the deep-layer network data.
In addition, the data sorting can be adopted in the embodiment, the similarity degree among the records is fully considered in the process of sorting the data, clustering is carried out according to the similarity degree and the attribute importance of the records, only one optimal record of the data of each cluster can obtain a larger rankngscore, and other similar records can correspondingly obtain a smaller rankingscore.
In addition, the extracted document deep-level structure information is considered to be applied to massive cross-modal image-text information retrieval, namely: according to the structure labels corresponding to each module, the massive cross-mode image-text documents are organized and managed, and then document information of related structure labels in a document library is opened according to different browsing permissions granted to users, so that the aim of intelligently managing massive data is fulfilled.
Furthermore, the present application may also provide a computer readable storage medium storing program data for implementing the methods and functions as described/shown in fig. 1-3 and the embodiments thereof when being executed by a processor.
The image-text data integration system can realize multi-level data cleaning processing, can achieve various effects such as accuracy, completeness, consistency, uniqueness, timeliness and effectiveness from data, can effectively solve the problems of data loss, inconsistency, repetition and the like, finally achieves organic integration of image-text data, obtains a comprehensive and comprehensive image-text database, and is beneficial to industry integration and transformation upgrading.
Although the present application has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application, and all changes, substitutions and alterations that fall within the spirit and scope of the application are to be understood as being included within the following description of the preferred embodiment.

Claims (10)

1. A data information processing method based on visual features is characterized by comprising the following steps:
acquiring image-text data based on visual characteristics through a network;
carrying out data preprocessing on the image-text data to obtain target data meeting a first requirement;
carrying out big data comprehensive processing on the target data to obtain effective data meeting a second requirement;
and establishing, updating and/or resetting the image-text database according to the effective data.
2. A visual-feature-based data-information processing method according to claim 1, characterized in that:
the step of performing data preprocessing on the image-text data specifically comprises:
carrying out data preprocessing of data cleaning, extraction, processing and/or classification on the image-text data; the step of performing big data comprehensive processing on the target data specifically includes:
and performing large data comprehensive processing of sensitive content filtering, overlapped content filtering and/or keyword extraction on the target data.
3. The visual feature-based data information processing method according to claim 1, wherein the visual feature-based data information processing method stores the graphic data, the target data and/or the valid data, in particular, by using an HDFS distributed file system based on a Hadoop distributed system architecture.
4. The visual feature-based data information processing method according to claim 3, wherein the step of storing the teletext data, the target data and/or the valid data, in particular by means of an HDFS distributed file system based on a Hadoop distributed system architecture, in particular comprises:
putting a URL set to be captured through a URL library to be captured, wherein the URL set records a text file of the URL to be captured and is used as an entrance for a crawler to enter an internet network;
the method comprises the steps that HTML information of the captured original webpage is stored through an original webpage stock, wherein the storage form comprises the steps that a URL is stored with a key value, and webpage HTML information corresponding to the URL is stored with a value;
the method comprises the steps of storing an analyzed chained-out link in a chained-out URL library, wherein the storage form comprises the steps of storing a URL with a key value and storing a chained-out link set contained in a webpage corresponding to the URL with a value;
and storing the XML information which is obtained by capturing the web page and is converted and processed and comprises the image-text data, the target data and the effective data through an XML stock, wherein the storage form comprises storing the URL with a key value and storing the XML information of the web page corresponding to the URL with a value.
5. The visual feature-based data information processing method according to claim 1, wherein the step of acquiring the visual feature-based graphics and text data via a network specifically includes:
and analyzing a data source query interface by using a recursive element combination mode based on a maximum entropy statistical model to acquire the image-text data from the deep-layer network data.
6. The visual feature-based data information processing method according to claim 1, wherein the step of acquiring the visual feature-based graphics and text data via a network specifically includes:
and adopting a progressive sampling mode of a model based on a weighted attribute value graph to collect Web database samples so as to collect the image-text data from deep network data.
7. The visual feature-based data information processing method according to claim 1, wherein the step of acquiring the visual feature-based graphics and text data via a network specifically includes:
data extraction and labeling are carried out by adopting a record alignment model based on attribute value similarity and marks so as to acquire the image-text data from deep-layer network data, wherein the method comprises the following steps:
aligning the attribute values in any two records by adopting a dynamic programming method to ensure that the sum of the similarity of the aligned attribute values is maximum;
aligning the attribute values of all records through global alignment to obtain a suboptimal solution;
and searching the repeated structure in a mode of similar attribute values.
8. The visual feature-based data information processing method according to claim 1, wherein the step of acquiring the visual feature-based graphics and text data via a network specifically includes:
and adopting a data value labeling mode based on the field to label the attribute value and performing mutual supplement operation of missing data on different data sources so as to acquire the image-text data from deep network data.
9. A visual characteristics-based data information processing system, characterized in that the data information processing system is provided with a processor for executing program data to implement the visual characteristics-based data information processing method according to any one of claims 1 to 8.
10. A visual-feature-based data-information processing system as claimed in claim 9, wherein said data-information processing system is further configured with data-and-services interfaces including a data-access interface, a data-exchange interface, an identity-authentication interface, and a related-system-integration interface for integrated and data-interactive with related systems.
CN201911009498.7A 2019-10-23 2019-10-23 Data information processing method and system based on visual features Pending CN110765106A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911009498.7A CN110765106A (en) 2019-10-23 2019-10-23 Data information processing method and system based on visual features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911009498.7A CN110765106A (en) 2019-10-23 2019-10-23 Data information processing method and system based on visual features

Publications (1)

Publication Number Publication Date
CN110765106A true CN110765106A (en) 2020-02-07

Family

ID=69333099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911009498.7A Pending CN110765106A (en) 2019-10-23 2019-10-23 Data information processing method and system based on visual features

Country Status (1)

Country Link
CN (1) CN110765106A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324761A (en) * 2013-07-11 2013-09-25 广州市尊网商通资讯科技有限公司 Product database forming method based on Internet data and system
US20170168472A1 (en) * 2015-09-29 2017-06-15 Kabushiki Kaisha Toshiba Information processing apparatus or information communication terminal, and information processing method
CN109740038A (en) * 2019-01-02 2019-05-10 安徽芃睿科技有限公司 Network data distributed parallel computing environment and method
CN109783619A (en) * 2018-12-14 2019-05-21 广东创我科技发展有限公司 A kind of data filtering method for digging

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324761A (en) * 2013-07-11 2013-09-25 广州市尊网商通资讯科技有限公司 Product database forming method based on Internet data and system
US20170168472A1 (en) * 2015-09-29 2017-06-15 Kabushiki Kaisha Toshiba Information processing apparatus or information communication terminal, and information processing method
CN109783619A (en) * 2018-12-14 2019-05-21 广东创我科技发展有限公司 A kind of data filtering method for digging
CN109740038A (en) * 2019-01-02 2019-05-10 安徽芃睿科技有限公司 Network data distributed parallel computing environment and method

Similar Documents

Publication Publication Date Title
CN111708773B (en) Multi-source scientific and creative resource data fusion method
CN107766371B (en) Text information classification method and device
CN101794311B (en) Fuzzy data mining based automatic classification method of Chinese web pages
US8676815B2 (en) Suffix tree similarity measure for document clustering
CN102708096B (en) Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN109493265A (en) A kind of Policy Interpretation method and Policy Interpretation system based on deep learning
US20110246462A1 (en) Method and System for Prompting Changes of Electronic Document Content
CN109271477A (en) A kind of method and system by internet building taxonomy library
CN105893611B (en) Method for constructing interest topic semantic network facing social network
CN105912684B (en) The cross-media retrieval method of view-based access control model feature and semantic feature
KR101801257B1 (en) Text-Mining Application Technique for Productive Construction Document Management
CN109344298A (en) A kind of method and device converting unstructured data to structural data
CN106874397B (en) Automatic semantic annotation method for Internet of things equipment
CN111723256A (en) Government affair user portrait construction method and system based on information resource library
Viet et al. Analyzing recent research trends of computer science from academic open-access digital library
CN114153983A (en) Multi-source construction method of industry knowledge graph
CN112597370A (en) Webpage information autonomous collecting and screening system with specified demand range
CN100336061C (en) Multimedia object searching device and methoed
Klampfl et al. Reconstructing the logical structure of a scientific publication using machine learning
CN114238735B (en) Intelligent internet data acquisition method
Maynard et al. Change management for metadata evolution
Correa et al. A deep search method to survey data portals in the whole web: toward a machine learning classification model
TWI793432B (en) Document management method and system for engineering project
CN115204393A (en) Smart city knowledge ontology base construction method and device based on knowledge graph
CN110765106A (en) Data information processing method and system based on visual features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination