CN110851136A - Data acquisition method and device, electronic equipment and storage medium - Google Patents

Data acquisition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110851136A
CN110851136A CN201910881318.8A CN201910881318A CN110851136A CN 110851136 A CN110851136 A CN 110851136A CN 201910881318 A CN201910881318 A CN 201910881318A CN 110851136 A CN110851136 A CN 110851136A
Authority
CN
China
Prior art keywords
data
file
reading
read
markup language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910881318.8A
Other languages
Chinese (zh)
Inventor
唐志辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910881318.8A priority Critical patent/CN110851136A/en
Priority to PCT/CN2019/118979 priority patent/WO2021051624A1/en
Publication of CN110851136A publication Critical patent/CN110851136A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A method of data acquisition, the method comprising: receiving a data acquisition request carrying a keyword; according to the keywords, obtaining extensible markup language files corresponding to the keywords from a cache, and obtaining files to be read; analyzing the extensible markup language file by using a hypertext markup language analyzer to obtain configuration information of each type of data in the extensible markup language file; acquiring a preset key from the configuration information, and acquiring a key value pair according to the configuration information and the preset key; reading the key-value pair into a cache; and reading the key value pair from the cache, determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format. The invention also provides a data acquisition device, electronic equipment and a storage medium. The invention can improve the reading efficiency of the file and has higher utilization rate of system resources.

Description

Data acquisition method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of intelligent terminals, in particular to a data acquisition method and device, electronic equipment and a storage medium.
Background
At present, when a computer is used for office work, various documents (such as contracts and resumes) are often required to be processed so as to acquire required data from the documents. The method generally adopted is to use a java hard coding mode as a text analysis tool, analyze a document through the java hard coding mode and obtain data. Specifically, the java hard coded coding file is stored in a certain local path, when a certain document needs to be analyzed, the coding file is loaded from the path, and the document is analyzed through the coding file, so that data is obtained.
Although the method can analyze the document to obtain data, the coding file usually occupies a large memory and needs to be stored locally, and the coding file is loaded from the local to analyze the document, which takes a long time, so that the document reading efficiency is low, and the utilization rate of system resources is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data acquisition method, apparatus, electronic device and storage medium, which can improve the efficiency of reading files as a whole and have a high utilization rate of system resources.
A first aspect of the present invention provides a data acquisition method, the method comprising:
receiving a data acquisition request carrying a keyword;
according to the keywords, obtaining extensible markup language files corresponding to the keywords from a cache, and obtaining files to be read;
analyzing the extensible markup language file by using a hypertext markup language analyzer to obtain configuration information of each type of data in the extensible markup language file;
acquiring a preset key from the configuration information, and acquiring a key value pair according to the configuration information and the preset key;
reading the key-value pair into a cache;
and reading the key value pair from the cache, determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format.
In one possible implementation manner, the parsing the xml file using a html parser to obtain configuration information of each type of data in the xml file includes:
for each type of data, determining a target label related to the data from the extensible markup language file according to the type of the data;
reading the target tag through a selector of the hypertext markup language parser to obtain configuration information of the data; or reading the target tag through a document object model access method of the hypertext markup language parser to obtain the configuration information of the data.
In a possible implementation manner, the obtaining, according to the configuration information and the preset key, a key-value pair includes:
storing the configuration information into a target object;
and forming a key value pair by the preset key and the target object.
In a possible implementation manner, the determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format includes:
determining a regular expression from the key value pair;
and using the regular expression to obtain target data matched with the regular expression from the data stored in the file to be read.
In a possible implementation manner, there are a plurality of regular expressions, and the obtaining, by using the regular expressions, target data matched with the regular expressions from data stored in the file to be read includes:
sequentially judging whether target data matched with the regular expressions exist in all data stored in the file to be read according to a preset arrangement sequence of the regular expressions;
and if target data matched with the regular expression exists in all the data stored in the file to be read, acquiring the target data matched with the regular expression.
In a possible implementation manner, after acquiring, according to the keyword, an xml file corresponding to the keyword from a cache and acquiring a file to be read, the method further includes:
analyzing the file to be read by using a text analysis tool to obtain an input stream;
saving the input stream to a cache;
the reading of the target data from the file to be read according to the data reading format comprises:
and reading target data in the input stream from the cache according to the data reading format.
In one possible implementation, the method further includes:
acquiring a parameter type and a parameter name from the key value pair;
and storing the target data according to the data storage format of the parameter type and the parameter name.
A second aspect of the present invention provides a data acquisition apparatus, the apparatus comprising:
the receiving module is used for receiving a data acquisition request carrying a keyword;
the first acquisition module is used for acquiring an extensible markup language file corresponding to the keyword from a cache according to the keyword and acquiring a file to be read;
the analysis module is used for analyzing the extensible markup language file by using a hypertext markup language analyzer to obtain the configuration information of each type of data in the extensible markup language file;
the second acquisition module is used for acquiring a preset key from the configuration information and acquiring a key value pair according to the configuration information and the preset key;
the first reading module is used for reading the key-value pair into a cache;
and the second reading module is used for reading the key value pair from the cache, determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format.
A third aspect of the invention provides an electronic device comprising a processor and a memory, the processor being configured to implement the data acquisition method when executing a computer program stored in the memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data acquisition method.
By the technical scheme, the data acquisition request can be received, and the data acquisition request carries the keywords and the file to be read; further, according to the keywords, obtaining extensible markup language files corresponding to the keywords from a cache; further, a hypertext markup language parser is used for parsing the extensible markup language file to obtain configuration information of each type of data in the extensible markup language file; furthermore, a preset key is obtained from the configuration information, and a key value pair is obtained according to the configuration information and the preset key; reading the key-value pair into a cache; and determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format. It can be seen that, in the present invention, an xml file may be stored in a cache in advance, when target data in a file to be read needs to be obtained, the xml file in the cache may be loaded, and the xml file is parsed by a html parser to obtain configuration information and key value pairs, and the key value pairs are read into the cache, and the target data in the file to be read is read according to the key value pairs.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a preferred embodiment of a data acquisition method disclosed in the present invention.
FIG. 2 is a functional block diagram of a preferred embodiment of a data acquisition device according to the present disclosure.
Fig. 3 is a schematic structural diagram of an electronic device implementing a data acquisition method according to a preferred embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The data acquisition method of the embodiment of the invention is applied to the electronic equipment, and can also be applied to a hardware environment formed by the electronic equipment and a server connected with the electronic equipment through a network, and the server and the electronic equipment are jointly executed. Networks include, but are not limited to: a wide area network, a metropolitan area network, or a local area network.
The electronic device includes an electronic device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and hardware thereof includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like. The electronic device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of hosts or network servers. The user device includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), or the like.
FIG. 1 is a flow chart of a preferred embodiment of a data acquisition method disclosed in the present invention. The order of the steps in the flowchart may be changed, and some steps may be omitted.
S11, the electronic equipment receives the data acquisition request carrying the keyword.
In the embodiment of the invention, different types of data stored in the file to be read are acquired, the rule for acquiring the data is stored in the extensible markup language file, and the extensible markup language file has a plurality of extensible markup language files, so that a data acquisition request carrying keywords needs to be received to determine the extensible markup language file.
And S12, the electronic equipment acquires the extensible markup language file corresponding to the keyword from the cache according to the keyword, and acquires the file to be read.
In the embodiment of the invention, because different keywords correspond to different extensible markup language files one by one, the corresponding extensible markup language file can be determined according to the keywords, and the extensible markup language file corresponding to the keywords can be obtained from the cache.
The file to be read may be a file pre-stored in the server or a file temporarily uploaded, and the file to be read may be determined and acquired by the keyword.
The files to be read may be contract files, client information files, resumes or other files to be read, the data stored in different files to be read are different, the files to be read of different types have corresponding extensible markup language files, and the different extensible markup language files have data acquisition rules corresponding to the files to be read of different types, that is, configuration information of the data. The format of the file to be read can be word or PDF.
The Extensible Markup Language (XML) is a Markup Language, and is used for transmitting and storing data; the extensible markup language has uniform standard grammar, and extensible markup language documents supported by almost all systems and products; because of the uniform format and syntax, the extensible markup language can be used across platforms. Here, the mark means an information symbol that can be understood by computers, and by this mark, articles containing various information can be handled between computers.
Various extensible markup language files can be stored in a cache in advance, system resources are fully utilized, and the utilization rate of the system resources is improved.
In addition, the speed of acquiring the file from the cache is higher than that of acquiring the file from the local, and the memory occupied by the extensible markup language file is smaller, so that the extensible markup language file corresponding to the keyword can be acquired from the cache quickly, and the file reading efficiency is improved.
And S13, the electronic equipment analyzes the extensible markup language file by using a hypertext markup language analyzer to obtain the configuration information of each type of data in the extensible markup language file.
The extensible markup language file stores configuration information of various types of data.
In the embodiment of the present invention, because the file to be read stores multiple types of data, for example, the data stored in the client information file includes but is not limited to: different types of data such as birth date, identification card number, name, bank account, education information, occupation, income condition, house property information and the like; the types of data are various, different functional modules of the system may need different types of data, so that the acquisition rules of different types of data, namely the configuration information of different types of data, need to be obtained by analyzing the extensible markup language file; the HTML parser is a Java HTML (HyperText Markup Language) parser, and may directly parse a URL (Uniform Resource Locator) address and HTML text content. It provides a very labor-saving API (Application Programming Interface) that can fetch and manipulate data through DOM (Document Object Model), CSS (Cascading Style Sheets), and a method similar to jQuery.
Because the xml file is analyzed by using the jsup, the xml file does not need to reserve an unavailable label, the workload of code writing is reduced, the code writing is more flexible, and the development efficiency is improved.
Each type of data in the xml file has different configuration information, which may include but is not limited to: the type of data, the regular expression, the preset key, the parameter type and the parameter name. Different types of data stored in the file to be read can be acquired through the configuration information.
Specifically, the parsing the xml file by using the html parser to obtain configuration information of each type of data in the xml file includes:
for each type of data, determining a target label related to the data from the extensible markup language file according to the type of the data;
reading the target tag through a selector of the hypertext markup language parser to obtain configuration information of the data; or reading the target tag through a document object model access method of the hypertext markup language parser to obtain the configuration information of the data.
In this optional embodiment, since there are multiple types of data, each type of data has different configuration information, it is necessary to determine, for each type of data, a tag related to the configuration information of the data in the xml file according to the type of the data, determine the tag as a target tag, and further, may read the target tag through a Selector (Selector) of a html parser jsup or read the target tag through a document object model access method of the html parser jsup to obtain the configuration information of the data.
Specifically, reading the target tag through the selector of the jsup means that the jsup supports a selector syntax similar to CSS or jQuery to search for a matched tag, the target tag can be searched through the selector, and an element list, that is, configuration information of the data, is returned.
Specifically, the reading of the target tag by the document object model access method means that a jsup can access a DOM, that is, a matched tag object can be obtained by a tag name, an id identifier, a class name, and the like, and the configuration information of the data can be obtained.
The target tag is a tag where the configuration information of the data is located, the tag is an XML tag, the XML tag is user-defined, each XML tag is provided with a corresponding closing tag, and content can be stored between the XML tag and the corresponding closing tag. Each XML tag, corresponding close tag, and deposited content may constitute an Element (Element).
S14, the electronic device obtains a preset key from the configuration information and obtains a key value pair according to the configuration information and the preset key.
In the embodiment of the present invention, the configuration information of each type of data needs to be further processed for subsequent use in acquiring the data.
The data has multiple types, and each type of data has corresponding configuration information, that is, each type of data has a corresponding key-value pair.
Specifically, the obtaining a key-value pair according to the configuration information and the preset key includes:
storing the configuration information into a target object;
and forming a key value pair by the preset key and the target object.
The key (key) refers to a key contained in a Map object, the Map is an object for mapping the key to a value, wherein the key and the value are mapped one by one to form a key value pair, and the value can be obtained through the key; given a key and a value, the value is stored in a Map object. The corresponding value may then be accessed through the key.
The preset key refers to a preset key, and keys corresponding to different types of data can be preset.
The Cache is a buffer (Cache) for data exchange, when data is to be read by a certain hardware, the required data is firstly searched from the Cache, if the required data is found, the data is directly executed, and if the required data is not found, the required data is found from a memory. Since the cache runs much faster than the memory, the cache is used to quickly read the common data.
Wherein the configuration information of the data comprises a preset key.
In this optional implementation, the preset key of each type of data may be obtained from the configuration information of each type of data, and before the configuration information of each type of data is read into the cache, the configuration information needs to be stored in an array object or other objects, so that a key value pair is conveniently formed with the preset key.
And S15, the electronic equipment reads the key-value pair into a cache.
After the key value pairs are formed and read into the cache, the configuration information of the data of the corresponding type can be quickly found through the preset keys, and meanwhile, system resources are fully utilized, so that the utilization rate of the system resources is high.
S16, the electronic device reads the key value pair from the cache, determines a data reading format according to the key value pair, and reads target data from the file to be read according to the data reading format.
In the embodiment of the present invention, because the file to be read stores multiple types of data, how to acquire each type of data in the file to be read needs to be determined according to configuration information of each type of data, and the configuration information is processed and stored in the key value pair, so that each type of data stored in the file to be read can be acquired according to the key value pair of each type of data, thereby avoiding frequent access to a database and reducing the load on the database.
Specifically, the determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format includes:
determining a regular expression from the key value pair;
and using the regular expression to obtain target data matched with the regular expression from the data stored in the file to be read.
The configuration information of each type of data includes a regular expression (english: RegularExpression, often abbreviated as regex, regexp, or RE in code), and the regular expression is a logical formula for operating on a character string, that is, a "regular character string" is formed by using specific characters defined in advance and a combination of the specific characters, and the "regular character string" is used to express a filtering logic for the character string.
In this optional embodiment, a regular expression may be determined and obtained from configuration information of each type of the data, and further, all data stored in the storage file may be matched with the regular expression to obtain data that is successfully matched.
Specifically, the obtaining, by using the regular expression, target data matched with the regular expression from data stored in the file to be read includes:
sequentially judging whether target data matched with the regular expressions exist in all data stored in the file to be read according to a preset arrangement sequence of the regular expressions;
and if target data matched with the regular expression exists in all the data stored in the file to be read, acquiring the target data matched with the regular expression.
In this alternative embodiment, because the same type of data may exist in different representations in different storage files, for example, a client's birth date is a nineteen-year-one-month-one number, then the possible representations of the birth date are: 1990.1.1, 1990-1-1, and the like. There are therefore a number of different regular expressions required to match data of different representations. Meanwhile, regular expressions matched with the common representation form of the data can be arranged in front in advance, so that the overall matching time can be saved, and the arrangement sequence of the regular expressions can be preset. When it is judged that the data matched with the regular expression exists in all the data stored in the storage file, the data matched with the regular expression can be obtained.
As an optional implementation manner, after acquiring, according to the keyword, an xml file corresponding to the keyword from a cache and acquiring a file to be read, the method further includes:
analyzing the file to be read by using a text analysis tool to obtain an input stream;
saving the input stream to a cache;
the reading of the target data from the file to be read according to the data reading format comprises:
and reading target data in the input stream from the cache according to the data reading format.
Wherein, the input stream refers to an object capable of reading byte sequence, and a stream can be understood as a sequence of data; an input stream represents reading data from one source.
In this optional embodiment, a text parsing tool Apache tika may be used to read the file to be read, and convert the data stored in the file to be read into an input stream, where the input stream may be stored in a cache, that is, the data stored in the file to be read may be stored in the cache, so that the speed of reading the data stored in the file to be read is faster, and the system performance is improved.
The Apache tika is a java-based tool kit for content detection and analysis, and can detect and extract contents from different file types (such as PPT, XLS and PDF).
As an optional implementation, the method further comprises:
acquiring a parameter type and a parameter name from the key value pair;
and storing the target data according to the data storage format of the parameter type and the parameter name.
The configuration information of the data in the key value pair comprises a parameter type and a parameter name.
Wherein the parameters include: arrays, lists, etc. may hold the type of data.
In this optional embodiment, each type of data corresponds to a different parameter type and/or parameter name, the parameter type and parameter name are determined from the configuration information of each type of data, and the data of each type is stored in a parameter, that is, different types of data are stored in different parameters, which facilitates a functional module or a method of the system to call different types of data.
In the method flow described in fig. 1, a data acquisition request may be received, where the data acquisition request carries a keyword and a file to be read; further, according to the keywords, obtaining extensible markup language files corresponding to the keywords from a cache; further, a hypertext markup language parser is used for parsing the extensible markup language file to obtain configuration information of each type of data in the extensible markup language file; furthermore, a preset key is obtained from the configuration information, and a key value pair is obtained according to the configuration information and the preset key; reading the key-value pair into a cache; and determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format. It can be seen that, in the present invention, an xml file may be stored in a cache in advance, when target data in a file to be read needs to be obtained, the xml file in the cache may be loaded, and the xml file is parsed by a html parser to obtain configuration information and key value pairs, and the key value pairs are read into the cache, and the target data in the file to be read is read according to the key value pairs.
The above description is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and it will be apparent to those skilled in the art that modifications may be made without departing from the inventive concept of the present invention, and these modifications are within the scope of the present invention.
FIG. 2 is a functional block diagram of a preferred embodiment of a data acquisition device according to the present disclosure.
In some embodiments, the data acquisition device operates in an electronic device. The data acquisition means may comprise a plurality of functional modules consisting of program code segments. Program code for various program segments in the data acquisition device may be stored in the memory and executed by the at least one processor to perform some or all of the steps in the data acquisition method described in fig. 1.
In this embodiment, the data acquisition device may be divided into a plurality of functional modules according to the functions performed by the data acquisition device. The functional module may include: the device comprises a receiving module 201, a first obtaining module 202, a parsing module 203, a second obtaining module 204, a first reading module 205 and a second reading module 206. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. The functions of the respective modules will be described in detail in the following embodiments.
The receiving module 201 is configured to receive a data obtaining request carrying a keyword.
In the embodiment of the invention, different types of data stored in the file to be read are acquired, the rule for acquiring the data is stored in the extensible markup language file, and the extensible markup language file has a plurality of extensible markup language files, so that a data acquisition request carrying keywords needs to be received to determine the extensible markup language file.
The first obtaining module 202 is configured to obtain, according to the keyword, an extensible markup language file corresponding to the keyword from a cache, and obtain a file to be read.
In the embodiment of the invention, because different keywords correspond to different extensible markup language files one by one, the corresponding extensible markup language file can be determined according to the keywords, and the extensible markup language file corresponding to the keywords can be obtained from the cache.
The file to be read may be a file pre-stored in the server or a file temporarily uploaded, and the file to be read may be determined and acquired by the keyword.
The files to be read may be contract files, client information files, resumes or other files to be read, the data stored in different files to be read are different, the files to be read of different types have corresponding extensible markup language files, and the different extensible markup language files have data acquisition rules corresponding to the files to be read of different types, that is, configuration information of the data. The format of the file to be read can be word or PDF.
The Extensible Markup Language (XML) is a Markup Language, and is used for transmitting and storing data; the extensible markup language has uniform standard grammar, and extensible markup language documents supported by almost all systems and products; because of the uniform format and syntax, the extensible markup language can be used across platforms. Here, the mark means an information symbol that can be understood by computers, and by this mark, articles containing various information can be handled between computers.
Various extensible markup language files can be stored in a cache in advance, system resources are fully utilized, and the utilization rate of the system resources is improved.
In addition, the speed of acquiring the file from the cache is higher than that of acquiring the file from the local, and the memory occupied by the extensible markup language file is smaller, so that the extensible markup language file corresponding to the keyword can be acquired from the cache quickly, and the file reading efficiency is improved.
And the parsing module 203 is configured to parse the xml file by using a hypertext markup language parser to obtain configuration information of each type of data in the xml file.
The extensible markup language file stores configuration information of various types of data.
In the embodiment of the present invention, because the file to be read stores multiple types of data, for example, the data stored in the client information file includes but is not limited to: different types of data such as birth date, identification card number, name, bank account, education information, occupation, income condition, house property information and the like; the types of data are various, different functional modules of the system may need different types of data, so that the acquisition rules of different types of data, namely the configuration information of different types of data, need to be obtained by analyzing the extensible markup language file; the HTML parser is a Java HTML (HyperText Markup Language) parser, and may directly parse a URL (Uniform Resource Locator) address and HTML text content. It provides a very labor-saving API (Application Programming Interface) that can fetch and manipulate data through DOM (Document Object Model), CSS (Cascading Style Sheets), and a method similar to jQuery.
Because the xml file is analyzed by using the jsup, the xml file does not need to reserve an unavailable label, the workload of code writing is reduced, the code writing is more flexible, and the development efficiency is improved.
Each type of data in the xml file has different configuration information, which may include but is not limited to: the type of data, the regular expression, the preset key, the parameter type and the parameter name. Different types of data stored in the file to be read can be acquired through the configuration information.
A second obtaining module 204, configured to obtain a preset key from the configuration information, and obtain a key-value pair according to the configuration information and the preset key.
In the embodiment of the present invention, the configuration information of each type of data needs to be further processed for subsequent use in acquiring the data.
The data has multiple types, and each type of data has corresponding configuration information, that is, each type of data has a corresponding key-value pair.
A first reading module 205, configured to read the key-value pair into a cache.
After the key value pairs are formed and read into the cache, the configuration information of the data of the corresponding type can be quickly found through the preset keys, and meanwhile, system resources are fully utilized, so that the utilization rate of the system resources is high.
The second reading module 206 is configured to read the key value pair from the cache, determine a data reading format according to the key value pair, and read target data from the file to be read according to the data reading format.
In the embodiment of the present invention, because the file to be read stores multiple types of data, how to acquire each type of data in the file to be read needs to be determined according to configuration information of each type of data, and the configuration information is processed and stored in the key value pair, so that each type of data stored in the file to be read can be acquired according to the key value pair of each type of data, thereby avoiding frequent access to a database and reducing the load on the database.
The parsing module 203 parses the xml file using a html parser, and the manner of obtaining the configuration information of each type of data in the xml file is specifically as follows:
for each type of data, determining a target label related to the data from the extensible markup language file according to the type of the data;
reading the target tag through a selector of the hypertext markup language parser to obtain configuration information of the data; or reading the target tag through a document object model access method of the hypertext markup language parser to obtain the configuration information of the data.
The second obtaining module 204 obtains the key value pair according to the configuration information and the preset key specifically by:
storing the configuration information into a target object;
and forming a key value pair by the preset key and the target object.
The second reading module 206 determines a data reading format according to the key value pair, and specifically, a manner of reading the target data from the file to be read according to the data reading format is as follows:
determining a regular expression from the key value pair;
and using the regular expression to obtain target data matched with the regular expression from the data stored in the file to be read.
The method includes that a plurality of regular expressions are provided, and the obtaining, by using the regular expressions, target data matched with the regular expressions from data stored in the file to be read includes:
sequentially judging whether target data matched with the regular expressions exist in all data stored in the file to be read according to a preset arrangement sequence of the regular expressions;
and if target data matched with the regular expression exists in all the data stored in the file to be read, acquiring the target data matched with the regular expression.
Optionally, the parsing module 203 is further configured to, after the first obtaining module 202 obtains the xml file corresponding to the keyword from the cache according to the keyword, and obtains the file to be read, parse the file to be read by using a text parsing tool, and obtain the input stream.
The data acquisition apparatus further includes:
and the storage module is used for storing the input stream into a cache.
The second reading module 206 reads the target data from the file to be read according to the data reading format, including:
and reading target data in the input stream from the cache according to the data reading format.
Optionally, the first obtaining module 202 is further configured to obtain a parameter type and a parameter name from the key value pair.
The storage module is further configured to store the target data according to the data storage format of the parameter type and the parameter name.
In the data obtaining apparatus depicted in fig. 2, a data obtaining request may be received, where the data obtaining request carries a keyword and a file to be read; further, according to the keywords, obtaining extensible markup language files corresponding to the keywords from a cache; further, a hypertext markup language parser is used for parsing the extensible markup language file to obtain configuration information of each type of data in the extensible markup language file; furthermore, a preset key is obtained from the configuration information, and a key value pair is obtained according to the configuration information and the preset key; reading the key-value pair into a cache; and determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format. It can be seen that, in the present invention, an xml file may be stored in a cache in advance, when target data in a file to be read needs to be obtained, the xml file in the cache may be loaded, and the xml file is parsed by a html parser to obtain configuration information and key value pairs, and the key value pairs are read into the cache, and the target data in the file to be read is read according to the key value pairs.
Fig. 3 is a schematic structural diagram of an electronic device implementing a data acquisition method according to a preferred embodiment of the invention. The electronic device 3 comprises a memory 31, at least one processor 32, a computer program 33 stored in the memory 31 and executable on the at least one processor 32, and at least one communication bus 34.
Those skilled in the art will appreciate that the schematic diagram shown in fig. 3 is merely an example of the electronic device 3, and does not constitute a limitation of the electronic device 3, and may include more or less components than those shown, or combine some components, or different components, for example, the electronic device 3 may further include an input/output device, a network access device, and the like.
The electronic device 3 may also include, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an Internet Protocol Television (IPTV), an intelligent wearable device, and the like. The Network where the electronic device 3 is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.
The at least one Processor 32 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The processor 32 may be a microprocessor or the processor 32 may be any conventional processor or the like, and the processor 32 is a control center of the electronic device 3 and connects various parts of the whole electronic device 3 by various interfaces and lines.
The memory 31 may be used to store the computer program 33 and/or the module/unit, and the processor 32 may implement various functions of the electronic device 3 by running or executing the computer program and/or the module/unit stored in the memory 31 and calling data stored in the memory 31. The memory 31 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data) created according to the use of the electronic device 3, and the like. Further, the memory 31 may include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other non-volatile solid state storage device.
With reference to fig. 1, the memory 31 in the electronic device 3 stores a plurality of instructions to implement a data acquisition method, and the processor 32 can execute the plurality of instructions to implement:
receiving a data acquisition request carrying a keyword;
according to the keywords, obtaining extensible markup language files corresponding to the keywords from a cache, and obtaining files to be read;
analyzing the extensible markup language file by using a hypertext markup language analyzer to obtain configuration information of each type of data in the extensible markup language file;
acquiring a preset key from the configuration information, and acquiring a key value pair according to the configuration information and the preset key;
reading the key-value pair into a cache;
and reading the key value pair from the cache, determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format.
Specifically, the processor 32 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again.
In the electronic device 3 depicted in fig. 3, a data obtaining request may be received, where the data obtaining request carries a keyword and a file to be read; further, according to the keywords, obtaining extensible markup language files corresponding to the keywords from a cache; further, a hypertext markup language parser is used for parsing the extensible markup language file to obtain configuration information of each type of data in the extensible markup language file; furthermore, a preset key is obtained from the configuration information, and a key value pair is obtained according to the configuration information and the preset key; reading the key-value pair into a cache; and determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format. It can be seen that, in the present invention, an xml file may be stored in a cache in advance, when target data in a file to be read needs to be obtained, the xml file in the cache may be loaded, and the xml file is parsed by a html parser to obtain configuration information and key value pairs, and the key value pairs are read into the cache, and the target data in the file to be read is read according to the key value pairs.
The integrated modules/units of the electronic device 3 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, and Read-Only Memory (ROM).
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for data acquisition, the method comprising:
receiving a data acquisition request carrying a keyword;
according to the keywords, obtaining extensible markup language files corresponding to the keywords from a cache, and obtaining files to be read;
analyzing the extensible markup language file by using a hypertext markup language analyzer to obtain configuration information of each type of data in the extensible markup language file;
acquiring a preset key from the configuration information, and acquiring a key value pair according to the configuration information and the preset key;
reading the key-value pair into a cache;
and reading the key value pair from the cache, determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format.
2. The method of claim 1, wherein parsing the xml file using a html parser to obtain configuration information for each type of data in the xml file comprises:
for each type of data, determining a target label related to the data from the extensible markup language file according to the type of the data;
reading the target tag through a selector of the hypertext markup language parser to obtain configuration information of the data; or reading the target tag through a document object model access method of the hypertext markup language parser to obtain the configuration information of the data.
3. The method of claim 1, wherein obtaining the key-value pair according to the configuration information and the preset key comprises:
storing the configuration information into a target object;
and forming a key value pair by the preset key and the target object.
4. The method according to claim 1, wherein the determining a data reading format according to the key-value pair, and reading target data from the file to be read according to the data reading format comprises:
determining a regular expression from the key value pair;
and using the regular expression to obtain target data matched with the regular expression from the data stored in the file to be read.
5. The method according to claim 4, wherein there are a plurality of regular expressions, and the obtaining, by using the regular expressions, target data matched with the regular expressions from the data stored in the file to be read includes:
sequentially judging whether target data matched with the regular expressions exist in all data stored in the file to be read according to a preset arrangement sequence of the regular expressions;
and if target data matched with the regular expression exists in all the data stored in the file to be read, acquiring the target data matched with the regular expression.
6. The method according to claim 1, wherein after obtaining the xml file corresponding to the keyword from the cache according to the keyword and obtaining the file to be read, the method further comprises:
analyzing the file to be read by using a text analysis tool to obtain an input stream;
saving the input stream to a cache;
the reading of the target data from the file to be read according to the data reading format comprises:
and reading target data in the input stream from the cache according to the data reading format.
7. The method according to any one of claims 1 to 6, further comprising:
acquiring a parameter type and a parameter name from the key value pair;
and storing the target data according to the data storage format of the parameter type and the parameter name.
8. A data acquisition apparatus, characterized in that the data acquisition apparatus comprises:
the receiving module is used for receiving a data acquisition request carrying a keyword;
the first acquisition module is used for acquiring an extensible markup language file corresponding to the keyword from a cache according to the keyword and acquiring a file to be read;
the analysis module is used for analyzing the extensible markup language file by using a hypertext markup language analyzer to obtain the configuration information of each type of data in the extensible markup language file;
the second acquisition module is used for acquiring a preset key from the configuration information and acquiring a key value pair according to the configuration information and the preset key;
the first reading module is used for reading the key-value pair into a cache;
and the second reading module is used for reading the key value pair from the cache, determining a data reading format according to the key value pair, and reading target data from the file to be read according to the data reading format.
9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the data acquisition method of any one of claims 1 to 7.
10. A computer-readable storage medium storing at least one instruction which, when executed by a processor, implements a data acquisition method as recited in any one of claims 1 to 7.
CN201910881318.8A 2019-09-18 2019-09-18 Data acquisition method and device, electronic equipment and storage medium Pending CN110851136A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910881318.8A CN110851136A (en) 2019-09-18 2019-09-18 Data acquisition method and device, electronic equipment and storage medium
PCT/CN2019/118979 WO2021051624A1 (en) 2019-09-18 2019-11-15 Data acquisition method and apparatus, and electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910881318.8A CN110851136A (en) 2019-09-18 2019-09-18 Data acquisition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110851136A true CN110851136A (en) 2020-02-28

Family

ID=69594835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910881318.8A Pending CN110851136A (en) 2019-09-18 2019-09-18 Data acquisition method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110851136A (en)
WO (1) WO2021051624A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035408A (en) * 2020-09-01 2020-12-04 文思海辉智科科技有限公司 Text processing method and device, electronic equipment and storage medium
CN113449502A (en) * 2021-06-29 2021-09-28 平安资产管理有限责任公司 Document generation method and system based on dynamic data
CN113553297A (en) * 2021-06-08 2021-10-26 优刻得科技股份有限公司 Management method and system for switch configuration information
CN117687626A (en) * 2024-02-04 2024-03-12 双一力(宁波)电池有限公司 Host computer and main program matching system and method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556536A (en) * 2008-04-11 2009-10-14 北京闻言科技有限公司 Method for configuring application program by using self-defining configuration files
CN105354311A (en) * 2015-11-10 2016-02-24 科大智能电气技术有限公司 Data key value pair storage method based on embedded equipment file system
CN106649451A (en) * 2016-09-22 2017-05-10 北京奇虎科技有限公司 Data update method and device
CN107145538A (en) * 2017-04-21 2017-09-08 网易(杭州)网络有限公司 List data querying method, device and system
CN107169047A (en) * 2017-04-25 2017-09-15 腾讯科技(深圳)有限公司 A kind of method and device for realizing data buffer storage
CN107562936A (en) * 2017-09-12 2018-01-09 中山大学 A kind of crawl of web page news list based on Jsoup and store method
CN109450969A (en) * 2018-09-27 2019-03-08 北京奇艺世纪科技有限公司 The method, apparatus and server of data are obtained from third party's data source server
CN109725932A (en) * 2017-10-31 2019-05-07 北京京东尚科信息技术有限公司 A kind of application component illustrates document generation method and device
CN109947720A (en) * 2019-04-12 2019-06-28 苏州浪潮智能科技有限公司 A kind of pre-reading method of files, device, equipment and readable storage medium storing program for executing

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102594833B (en) * 2012-03-09 2016-01-06 北京思特奇信息技术股份有限公司 A kind of communication protocol adapting method and system
CN108885627B (en) * 2016-01-11 2022-04-05 甲骨文美国公司 Query-as-a-service system providing query result data to remote client
CN108228597A (en) * 2016-12-14 2018-06-29 深圳市优朋普乐传媒发展有限公司 Data bank access method and device
CN107908485B (en) * 2017-10-26 2020-08-04 中国平安人寿保险股份有限公司 Interface parameter transmission method, device, equipment and computer readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556536A (en) * 2008-04-11 2009-10-14 北京闻言科技有限公司 Method for configuring application program by using self-defining configuration files
CN105354311A (en) * 2015-11-10 2016-02-24 科大智能电气技术有限公司 Data key value pair storage method based on embedded equipment file system
CN106649451A (en) * 2016-09-22 2017-05-10 北京奇虎科技有限公司 Data update method and device
CN107145538A (en) * 2017-04-21 2017-09-08 网易(杭州)网络有限公司 List data querying method, device and system
CN107169047A (en) * 2017-04-25 2017-09-15 腾讯科技(深圳)有限公司 A kind of method and device for realizing data buffer storage
CN107562936A (en) * 2017-09-12 2018-01-09 中山大学 A kind of crawl of web page news list based on Jsoup and store method
CN109725932A (en) * 2017-10-31 2019-05-07 北京京东尚科信息技术有限公司 A kind of application component illustrates document generation method and device
CN109450969A (en) * 2018-09-27 2019-03-08 北京奇艺世纪科技有限公司 The method, apparatus and server of data are obtained from third party's data source server
CN109947720A (en) * 2019-04-12 2019-06-28 苏州浪潮智能科技有限公司 A kind of pre-reading method of files, device, equipment and readable storage medium storing program for executing

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035408A (en) * 2020-09-01 2020-12-04 文思海辉智科科技有限公司 Text processing method and device, electronic equipment and storage medium
CN112035408B (en) * 2020-09-01 2023-10-31 文思海辉智科科技有限公司 Text processing method, device, electronic equipment and storage medium
CN113553297A (en) * 2021-06-08 2021-10-26 优刻得科技股份有限公司 Management method and system for switch configuration information
CN113553297B (en) * 2021-06-08 2023-01-06 优刻得科技股份有限公司 Management method and system for switch configuration information
CN113449502A (en) * 2021-06-29 2021-09-28 平安资产管理有限责任公司 Document generation method and system based on dynamic data
CN117687626A (en) * 2024-02-04 2024-03-12 双一力(宁波)电池有限公司 Host computer and main program matching system and method
CN117687626B (en) * 2024-02-04 2024-05-03 双一力(宁波)电池有限公司 Host computer and main program matching system and method

Also Published As

Publication number Publication date
WO2021051624A1 (en) 2021-03-25

Similar Documents

Publication Publication Date Title
CN109325009B (en) Log analysis method and device
CN108920659B (en) Data processing system, data processing method thereof, and computer-readable storage medium
CN109522018B (en) Page processing method and device and storage medium
CN110851136A (en) Data acquisition method and device, electronic equipment and storage medium
JP6922538B2 (en) API learning
CN100440222C (en) System and method for text legibility enhancement
WO2007144853A2 (en) Method and apparatus for performing customized paring on a xml document based on application
CN110795697B (en) Method and device for acquiring logic expression, storage medium and electronic device
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
US20220121668A1 (en) Method for recommending document, electronic device and storage medium
CN111241496B (en) Method and device for determining small program feature vector and electronic equipment
CN108885544B (en) Front-end page internationalized processing method, application server and computer-readable storage medium
CN116644213A (en) XML file reading method, device, equipment and storage medium
CN104778232A (en) Searching result optimizing method and device based on long query
CN113127776A (en) Breadcrumb path generation method and device and terminal equipment
CN112380337A (en) Highlight method and device based on rich text
CN111639250A (en) Enterprise description information acquisition method and device, electronic equipment and storage medium
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium
CN115759029A (en) Document template processing method and device, electronic equipment and storage medium
JP2024507029A (en) Web page identification methods, devices, electronic devices, media and computer programs
US20150324333A1 (en) Systems and methods for automatically generating hyperlinks
CN112016017A (en) Method and device for determining characteristic data
CN113779438B (en) Webpage text information processing method and device and terminal equipment
CN109522211A (en) Interface parameters transmission method, device, electronic equipment and storage medium
CN111310465B (en) Parallel corpus acquisition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination