CN115098642A - Data processing method and device, computer equipment and storage medium - Google Patents

Data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115098642A
CN115098642A CN202210817671.1A CN202210817671A CN115098642A CN 115098642 A CN115098642 A CN 115098642A CN 202210817671 A CN202210817671 A CN 202210817671A CN 115098642 A CN115098642 A CN 115098642A
Authority
CN
China
Prior art keywords
data
original document
structured data
elements
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210817671.1A
Other languages
Chinese (zh)
Inventor
屠乐奇
郭磊
戎辉
杨真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Asset Management Co Ltd
Original Assignee
Ping An Asset Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Asset Management Co Ltd filed Critical Ping An Asset Management Co Ltd
Priority to CN202210817671.1A priority Critical patent/CN115098642A/en
Publication of CN115098642A publication Critical patent/CN115098642A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a data processing method, a data processing device, computer equipment and a storage medium, and belongs to the field of computers. According to the method and the device, the acquired original document can be preprocessed to obtain the data to be processed, the data to be processed is identified by adopting the semantic model to obtain the data elements and the associated information corresponding to the data elements, and the associated structured data corresponding to the original document is further generated according to the associated information and the data elements, so that the associated structured data is convenient to review and check compared with the original document, and the data review efficiency and the data accuracy are greatly improved.

Description

Data processing method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of computers, and in particular, to a data processing method, apparatus, computer device, and storage medium.
Background
The preparation, disclosure, search and use of financial documents has been a very labor intensive process, especially in financial scenarios such as bond issuing, where large numbers of documents are processed, and where multiple preparation of financial documents is typically a heavy task. Particularly, the financial reports in an unstructured form often need to be reviewed manually, and errors are easily generated in heavy work, so that the accuracy of the reports is reduced.
Disclosure of Invention
Aiming at the problem that the existing unstructured financial document is not beneficial to review, a data processing method, a device, a computer device and a storage medium aiming at facilitating the review are provided.
In order to achieve the above object, the present invention provides a data processing method, including:
acquiring an original document;
preprocessing the original document to obtain data to be processed;
and identifying data to be processed by adopting a semantic model to obtain data elements and associated information corresponding to the data elements, and generating associated structured data corresponding to the original document according to the associated information and the data elements.
Optionally, the preprocessing the original document to obtain data to be processed includes:
and identifying the original document by adopting an optical character identification method to obtain the data to be processed.
Optionally, the identifying, by using a semantic model, data to be processed to obtain a data element and associated information corresponding to the data element, and generating associated structured data corresponding to the original document according to the associated information and the data element includes:
performing word segmentation recognition on the data to be processed according to words in a financial dictionary by adopting a semantic model to obtain the data elements and the associated scores of the data elements;
determining the association information of the data elements in the original document according to the positions, the types and the association scores of the data elements in the original document;
generating associated structured data corresponding to the original document based on the data elements and the associated information corresponding to the data elements, wherein the associated structured data comprises at least one piece of structured data, each piece of structured data corresponds to one data element, and the structured data comprises paragraph positions, paragraph types, page numbers, associated scores and the data elements of the data elements in the original document.
Optionally, the method further includes:
receiving a search request, wherein the search request comprises a target element and a search object;
according to a search object in the search request, determining an original document corresponding to the search object;
searching the associated structured data corresponding to the original document according to a target element in the search request, and acquiring target structured data corresponding to the target element;
and determining corresponding content in the original document according to the target structured data.
Optionally, the search object corresponds to one original document, or the search object corresponds to a plurality of original documents.
Optionally, the searching, according to the target element in the search request, the associated structured data corresponding to the original document to obtain the target structured data corresponding to the target element includes:
searching all structured data in the associated structured data corresponding to the original document according to a target element in the search request, and acquiring candidate structured data matched with the target element;
and determining the target structured data matched with the target elements according to the obtained association scores of the data elements in the candidate structured data.
Optionally, the original document and the associated structured data corresponding to the original document are verified by using a regular expression.
To achieve the above object, the present invention provides a data processing apparatus comprising:
an acquisition unit configured to acquire an original document;
the processing unit is used for preprocessing the original document to obtain data to be processed;
and the association unit is used for identifying data to be processed by adopting a semantic model to obtain a data element and association information corresponding to the data element, and generating association structured data corresponding to the original document according to the association information and the data element.
To achieve the above object, the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
To achieve the above object, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the above method.
The data processing method, the data processing device, the computer equipment and the storage medium provided by the invention can be used for preprocessing the acquired original document to obtain the data to be processed, identifying the data to be processed by adopting the semantic model to obtain the data elements and the associated information corresponding to the data elements, and further generating the associated structured data corresponding to the original document according to the associated information and the data elements, so that the associated structured data is convenient to review and correct compared with the original document, and the data review efficiency and the data accuracy are greatly improved.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of a data processing method according to the present invention;
FIG. 2 is a flow chart of a method of another embodiment of a data processing method according to the present invention;
FIG. 3 is a block diagram of an embodiment of a data processing apparatus according to the present invention;
fig. 4 is a schematic diagram of a hardware architecture of an embodiment of a computer device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The data processing method, the data processing device, the computer equipment and the storage medium are suitable for the financial field. According to the method and the device, the acquired original document can be preprocessed to obtain the data to be processed, the data element and the associated information corresponding to the data element are obtained by identifying the data to be processed by adopting the semantic model, and the associated structured data corresponding to the original document is further generated according to the associated information and the data element, so that the associated structured data is convenient to review and correct relative to the original document, and the data review efficiency and the data accuracy are greatly improved.
Example one
Referring to fig. 1, a data processing method of the present embodiment includes the following steps:
s1, obtaining an original document.
In this embodiment, the original document may be an original financial document or a recruitment instruction, and the original document may be in a PDF format or an image document format. The data processing method of the embodiment can be applied to data normalization processing of financial documents, and unstructured financial documents are converted into associated structured data which are structured and convenient to review.
And S2, preprocessing the original document to obtain data to be processed.
Further, in step S2, the original document may be recognized by an Optical Character Recognition (OCR) method to obtain the to-be-processed data.
In this embodiment, an original document is scanned by an OCR recognition method, and an image document is analyzed to obtain text information and version information.
And S3, identifying data to be processed by adopting a semantic model to obtain data elements and associated information corresponding to the data elements, and generating associated structured data corresponding to the original document according to the associated information and the data elements.
It should be noted that: the associated structured data comprises at least one piece of structured data, each piece of structured data corresponds to one data element, and the structured data comprises the data element, the position of the paragraph in the original document, the type of the paragraph, the number of pages, the associated score and the data element.
In the embodiment, the data element and the associated information corresponding to the data element are obtained by identifying the data to be processed by adopting the semantic model, and the associated structured data corresponding to the original document is further generated according to the associated information and the data element.
Further, step S3 may include:
and S31, performing word segmentation recognition on the data to be processed by adopting a semantic model according to words in a financial dictionary to obtain the data elements and the associated scores of the data elements.
The so-called financial dictionary is a tool book compiled by the modern Tianjin finance and economics institute. The thesaurus receives 1800 entries for the financial aspect. 9 types including basic theory of finance, banking business, insurance and trust investment, international finance, Chinese currency, Chinese financial history, foreign currency, foreign bank and financial group and the like; the "other" category picks items that have an affinity with the financial profession. The word order is arranged according to the number of strokes of the first character of the Chinese character, and a classification catalogue is attached. The book is followed by 12 appendices of the names of currencies of various countries, the largest 100 commercial banks in the world, and the like.
In the present embodiment, the semantic Model may use a Hidden Markov Model (HMM). In a hidden markov model, a state is not directly visible, but the output is visible depending on the state. Each state has a possible probability distribution over the possible output tokens. Thus, the generation of a sequence of labels by an HMM provides information about some sequence of states.
In this embodiment, the semantic model may also adopt a word segmentation algorithm (e.g., forward maximum matching algorithm, reverse maximum matching algorithm, bidirectional maximum matching algorithm, semantic understanding word segmentation, word frequency statistics word segmentation) and/or a vector space model. The vector space model is a characteristic item sequence required by text representation generated according to a training sample set; and then according to the text feature item sequence, carrying out weight assignment, normalization and other processing on each document in the training sample set and the test sample set, converting the document into a feature vector extremely required by a learning algorithm, and for the document containing feature items (data elements), giving a weight to each feature item according to a preset rule to express the importance degree of the feature item in the document, thereby obtaining the association score of the data elements. The association score is a score based on the importance of the elements in the original document.
And S32, determining the associated information of the data elements in the original document according to the positions, the types and the associated scores of the data elements in the original document.
In the present embodiment, the association information of the obtained data elements with the original document is determined according to their corresponding position information (position and category) in the original document.
Specifically, the corresponding position information of the data element in the original document may include: information such as the position of the paragraph where the data element is located, the type of the paragraph (title, paragraph) where the data element is located, and the number of pages where the data element is located. The related information includes information such as the position of the paragraph where the data element is located, the type of the paragraph where the data element is located (title, paragraph), the number of pages where the data element is located, the content corresponding to the data element, and the related score of the data element.
And S33, generating associated structured data corresponding to the original document based on the data elements and the associated information corresponding to the data elements.
In this embodiment, each data element corresponds to a piece of structured data, which is the associated information corresponding to the data element. And sequencing the associated information corresponding to the data elements in sequence according to the sequencing of the data elements in the original document, thereby obtaining associated structured data.
In this embodiment, in order to construct the relationship between the associated structured data and the original document, the associated structured data holds a digital abstract of the original document, the whole document of the original document is divided into a plurality of data elements (elements), each data element holds a coordinate position and a page number in a text, and each data element has a specific type, such as a paragraph, a table, a title, and the like. In the structuring process, financial expert knowledge is added, word segmentation is carried out through a word segmentation device of a semantic model, financial semantics is added, various financial indexes and terms are identified, and therefore the relevancy of each data element of the document is scored.
In a preferred embodiment, referring to the data processing method shown in fig. 2, the method further includes:
and S4, receiving a search request, wherein the search request comprises a target element and a search object.
It should be noted that: the search object corresponds to one original document, or the search object corresponds to a plurality of original documents.
In this embodiment, a specific original document can be searched according to the search request, and a plurality of original documents can be searched according to the search request, so as to achieve the purpose of cross-document intelligent search.
And S5, determining an original document corresponding to the search object according to the search object in the search request.
S6, searching the associated structured data corresponding to the original document according to the target elements in the search request, and acquiring the target structured data corresponding to the target elements.
Further, step S6 may include:
s61, searching all the structural data in the associated structural data corresponding to the original document according to the target elements in the search request, and acquiring candidate structural data matched with the target elements.
S62, determining target structured data matched with the target elements according to the obtained association scores of the data elements in the candidate structured data.
In this embodiment, data with a correlation score meeting a preset condition is screened out from the candidate structured data as target structured data. The preset condition may be that the association score is higher than a certain threshold, or the first few bits (e.g. the first three bits) with the highest association score are selected. By way of example and not limitation, in the annual financial statement, the most appearing are various financial tables, and the financial table is divided into a plurality of data elements according to different cells. As shown in the table below, each cell is an independent data element;
index 1 Index 2
Subject 1 Data 1 Data 2
Subject 2 Data 3 Data 4
When a user searches data 4, two data elements related to high business of the data 4, namely the subject and the index of the data, namely subject 2 and index 2, are assigned with higher associated scores, and in the intelligent search process of the user, the data elements of the subject and the index are given preferentially according to the relevance.
By way of example and not limitation, the term "legal profit-and-allowance" is a financial index that an enterprise must extract, from profits, 10% of the current net profit to make up for the loss, and not extract when the legal profit-and-allowance reaches 50% of the registered capital. Therefore, the legal profit and surplus public product is directly related to the net profit and loss of the enterprise through expert knowledge injection, and the correlation degree of the net profit and loss query is higher than that of other terms, thereby facilitating the user to associate the search. When searching legal profit-and-remainder products, the original document is subjected to word segmentation by a word segmenter and expert knowledge is added, a Skip-gram method is adopted to calculate a vector, the document or a cross-document is intelligently searched according to the similarity in the vector, and a financial index in other financial documents with highest similarity and best relevance is found through a financial report.
And S7, determining corresponding contents in the original document according to the target structured data.
In the embodiment, according to a search object in a search request, an original document corresponding to the search object is determined, and then associated structured data corresponding to the original document is obtained, a target element is searched in the associated structured data to obtain a data element corresponding to the target element, the structured data corresponding to the data element is used as target structured data, and corresponding data is positioned in the original document according to the target structured data and is output.
In a preferred embodiment, after the step S3, the data processing method may further include:
A. and verifying the original document and the associated structured data corresponding to the original document by adopting a regular expression.
In this embodiment, the regular expression may include a check number expression (e.g., positive, negative, decimal, non-negative positive integer, floating point number, etc.), a check character expression (e.g., a character string composed of chinese characters, english, numbers, 26 lower case letters, a character string composed of 26 upper case letters, etc.), and a special requirement expression (e.g., an email address, a domain name, a mobile phone number, an identification number, a month format, whether an account is legal, a date format, etc.).
In order to structure and normalize the financial statements, XBRL (eXtensible Business Reporting Language) technology was born. The XBRL is a computer language which is based on the Internet and operates in a cross-platform mode and is specially used for compiling, disclosing and using financial reports, and basically realizes the integration and maximum utilization of data. However, for historical financial documents recomposed by using the XBRL technology, it is difficult to perform proofreading and verification with the original unstructured data, so that the accuracy of the recomposed financial documents cannot be ensured.
The data processing method adopted by the embodiment can conveniently structure the original document (historical financial document), is easy for document propagation and transcription, can enable a user to quickly realize linkage navigation and positioning of the original document, proofreads the difference between the structured form and the original document, is easy to generate indexes such as a form catalog and the like, and greatly reduces manual work, thereby improving the phenomena that the XBRL financial document and the original financial document on the market are split and cannot be verified. In addition, the method of adding expert knowledge and NLP (Natural Language Processing) is beneficial to professional practitioners to quickly search, and relevant documents are positioned through an intelligent method, so that the service quality of an enterprise is improved, and the satisfaction degree of customers is improved.
In this embodiment, the data processing method can preprocess the acquired original document to obtain data to be processed, identify the data to be processed by using the semantic model to obtain the data elements and the associated information corresponding to the data elements, and further generate the associated structured data corresponding to the original document according to the associated information and the data elements, so that the associated structured data is convenient to review and collate compared with the original document, and the efficiency of data review and the accuracy of data are greatly improved.
In the embodiment, the data processing method can be used for recompiling a large number of unstructured original documents (such as financial documents) into associated structured data, and the associated structured data can be proofread from the unstructured original documents, so that the manual proofreading cost is reduced, and the reliability and accuracy of the newly compiled documents are improved; the associated structured data is convenient to search, cross-document intelligent search is supported, the using effect is improved, and the user experience effect is good.
Example two
Referring to fig. 3, a data processing apparatus 1 of the present embodiment may include: an acquisition unit 11, a processing unit 12 and an association unit 13.
An acquisition unit 11 for acquiring an original document.
In this embodiment, the original document may be an original financial document or an recruitment instruction, and the original document may be in a PDF format or an image document format. The data processing device 1 of the embodiment can be applied to data standardization processing of financial documents, and the unstructured financial documents are converted into associated structured data which are structured and convenient to review.
And the processing unit 12 is configured to perform preprocessing on the original document to obtain data to be processed.
Further, the processing unit 12 may use an optical character recognition method to recognize the original document to obtain the data to be processed.
In this embodiment, an original document is scanned by an OCR recognition method, and an image document is analyzed to obtain text information and version information.
The association unit 13 is configured to identify data to be processed by using a semantic model to obtain a data element and association information corresponding to the data element, and generate associated structured data corresponding to the original document according to the association information and the data element.
It should be noted that: the associated structured data comprises at least one piece of structured data, each piece of structured data corresponds to one data element, and the structured data comprises the data element, the paragraph position of the original document, the paragraph type, the page number, the associated score and the data element.
In the embodiment, the data element and the associated information corresponding to the data element are obtained by identifying the data to be processed by adopting the semantic model, and the associated structured data corresponding to the original document is further generated according to the associated information and the data element.
Further, the associating unit 13 may include: the device comprises an identification module, a positioning module and a generation module.
And the recognition module is used for performing word segmentation recognition on the data to be processed according to the words in the financial dictionary by adopting a semantic model to obtain the data elements and the associated scores of the data elements.
The so-called financial dictionary is a tool book compiled by the modern Tianjin finance and economics institute. The thesaurus receives 1800 entries for the financial aspect. 9 types including basic theory of finance, banking business, insurance and trust investment, international finance, Chinese currency, Chinese financial history, foreign currency, foreign bank and financial group and the like; the "other" category picks items that have an affinity with the financial profession. The word order is arranged according to the number of strokes of the first character of the Chinese character, and a classification catalogue is attached. The book is followed by 12 appendices of the names of currencies of various countries, the largest 100 commercial banks in the world, and the like.
In the present embodiment, the semantic Model may use a Hidden Markov Model (HMM). In a hidden markov model, a state is not directly visible, but the output is visible depending on the state. Each state has a possible probability distribution over the possible output tokens. Thus, the generation of a sequence of labels by an HMM provides information about some sequence of states.
In this embodiment, the semantic model may also adopt a word segmentation algorithm (e.g., forward maximum matching algorithm, reverse maximum matching algorithm, bidirectional maximum matching algorithm, semantic understanding word segmentation, word frequency statistics word segmentation) and/or a vector space model. The vector space model is a characteristic item sequence required by text representation generated according to a training sample set; and then according to the text feature item sequence, carrying out weight assignment, normalization and other processing on each document in the training sample set and the test sample set, converting the document into a feature vector extremely required by a learning algorithm, and for the document containing feature items (data elements), giving a weight to each feature item according to a preset rule to express the importance degree of the feature item in the document, thereby obtaining the association score of the data elements. The association score is a score of the importance of the data elements in the original document.
And the positioning module is used for determining the associated information of the data elements in the original document according to the positions, the types and the associated scores of the data elements in the original document.
In the present embodiment, the association information of the obtained data elements with the original document is determined according to their corresponding position information (position and category) in the original document.
Specifically, the corresponding position information of the data element in the original document may include: information such as the position of the paragraph where the data element is located, the type of the paragraph (title, paragraph) where the data element is located, and the number of pages where the data element is located. The related information includes information such as the position of the paragraph where the data element is located, the type (title, paragraph) of the paragraph where the data element is located, the number of pages where the data element is located, the content corresponding to the data element, and the related score of the data element.
And the generating module is used for generating the associated structured data corresponding to the original document based on the data elements and the associated information corresponding to the data elements.
In this embodiment, each data element corresponds to a piece of structured data, which is the associated information corresponding to the data element. And sequencing the associated information corresponding to the data elements in sequence according to the sequencing of the data elements in the original document, thereby obtaining associated structured data.
In this embodiment, in order to construct the relationship between the associated structured data and the original document, the associated structured data holds a digital abstract of the original document, the whole document of the original document is divided into a plurality of data elements (elements), each data element holds a coordinate position and a page number in a text, and each data element has a specific type, such as a paragraph, a table, a title, and the like. In the structuring process, financial expert knowledge is added, word segmentation is carried out through a word segmentation device of a semantic model, financial semantics is added, various financial indexes and terms are identified, and therefore the relevancy of each data element of the document is scored.
In a preferred embodiment, the data processing device 1 further comprises: the device comprises a receiving unit, a determining unit and a searching unit.
A receiving unit, configured to receive a search request.
Wherein the search request includes a target element and a search object.
It should be noted that: the search object corresponds to one original document, or the search object corresponds to a plurality of original documents.
In this embodiment, a specific original document can be searched according to the search request, and a plurality of original documents can be searched according to the search request, so as to achieve the purpose of cross-document intelligent search.
And the determining unit is used for determining an original document corresponding to the search object according to the search object in the search request.
And the searching unit is used for searching the associated structured data corresponding to the original document according to the target element in the searching request and acquiring the target structured data corresponding to the target element.
Further, the search unit may include: the device comprises a searching module, a first determining module and a second determining module.
And the searching module is used for searching all the structural data in the associated structural data corresponding to the original document according to the target elements in the searching request and acquiring the candidate structural data matched with the target elements.
And the first determining module is used for determining the target structured data matched with the target element according to the obtained association scores of the data elements in the candidate structured data.
In this embodiment, data with a correlation score meeting a preset condition is screened out from the candidate structured data as target structured data. The preset condition may be that the association score is higher than a certain threshold, or the first few bits (e.g. the first three bits) with the highest association score are selected. By way of example and not limitation, in an annual financial statement, the most appearing are various financial tables, which are divided into a plurality of data elements according to different cells.
By way of example and not limitation, the term "legal profit margin" is a financial index, and means that the enterprise must extract 10% of the current net profit from the profit to make up for the loss, and not extract the legal profit margin when the legal profit margin reaches 50% of the registered capital. Therefore, the legal full profit and surplus accumulation is directly related to the net profit and loss of the enterprise through expert knowledge injection, and the correlation degree of the net profit and loss query is higher than that of other terms, thereby facilitating the user to associate the search. And when searching legal profit-and-tolerance products, dividing words of the original document by a word divider and adding expert knowledge, calculating a vector by adopting a Skip-gram method, intelligently searching the document or the cross-document according to the similarity in the vector, and finding a certain financial index in other financial documents with highest similarity and best relevance through a financial report.
And the second determining module is used for determining corresponding content in the original document according to the target structured data.
In the embodiment, according to a search object in a search request, an original document corresponding to the search object is determined, and further associated structured data corresponding to the original document is obtained, a target element is searched in the associated structured data to obtain a data element corresponding to the target element, the structured data corresponding to the data element is taken as target structured data, and corresponding data is positioned in the original document according to the target structured data and is output.
In a preferred embodiment, the data processing apparatus 1 may further include: and a verification unit.
And the verification unit is used for verifying the original document and the associated structured data corresponding to the original document by adopting a regular expression.
In this embodiment, the regular expression may include a check number expression (e.g., positive, negative, decimal, non-negative positive integer, floating point number, etc.), a check character expression (e.g., a character string composed of chinese characters, english, numbers, 26 lower case letters, a character string composed of 26 upper case letters, etc.), and a special requirement expression (e.g., an email address, a domain name, a mobile phone number, an identification number, a month format, whether an account is legal, a date format, etc.).
In order to structure and normalize financial statements, XBRL (eXtensible Business Reporting Language) technology was developed. The XBRL is a computer language which is based on the Internet and operates in a cross-platform mode and is specially used for compiling, disclosing and using financial reports, and basically realizes the integration and maximum utilization of data. However, for historical financial documents recomposed by using the XBRL technology, it is difficult to perform proofreading and verification with the original unstructured data, so that the accuracy of the recomposed financial documents cannot be ensured.
The data processing device 1 adopted by the embodiment can not only conveniently structure the original document (historical financial document) and facilitate the propagation and transcription of the document, but also can quickly realize linkage navigation and positioning of the original document by a user, proofreads the difference between the structured form and the original document, easily generates indexes such as a form catalog and the like, greatly reduces the manual work, and improves the phenomena that the XBRL financial document and the original financial document on the market are split and cannot be verified. In addition, the method of adding expert knowledge and NLP (Natural Language Processing) is helpful for professional practitioners to quickly search, locate relevant documents through an intelligent method, help enterprises to improve service quality and improve customer satisfaction.
In this embodiment, the data processing apparatus 1 can pre-process the acquired original document through the processing unit 12 to obtain data to be processed, recognize the data to be processed through the association unit 13 by using a semantic model to obtain a data element and association information corresponding to the data element, and further generate associated structured data corresponding to the original document according to the association information and the data element, the associated structured data is convenient for review and proofreading relative to the original document, and the efficiency of data review and the accuracy of data are greatly improved.
In the embodiment, the data processing device 1 can re-compile a large amount of unstructured original documents (such as financial documents) into associated structured data, and the associated structured data can be proofread from the unstructured original documents, so that the manual proofreading cost is reduced, and the reliability and the accuracy of the newly compiled documents are improved; the associated structured data is convenient to search, cross-document intelligent search is supported, the using effect is improved, and the user experience effect is good.
EXAMPLE III
In order to achieve the above object, the present invention further provides a computer device 2, where the computer device 2 includes a plurality of computer devices 2, components of the data processing apparatus 1 according to the second embodiment may be distributed in different computer devices 2, and the computer device 2 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster formed by a plurality of servers) that executes a program, or the like. The computer device 2 of the present embodiment includes at least, but is not limited to: a memory 21, a processor 23, a network interface 22, and the data processing apparatus 1 (refer to fig. 4) that can be communicatively connected to each other through a system bus. It is noted that fig. 4 only shows the computer device 2 with components, but it is to be understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the memory 21 includes at least one type of computer-readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 2. Of course, the memory 21 may also comprise both an internal storage unit of the computer device 2 and an external storage device thereof. In this embodiment, the memory 21 is generally used for storing an operating system installed in the computer device 2 and various types of application software, such as a program code of the data processing method in the first embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 23 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor, or other data Processing chip in some embodiments. The processor 23 is typically used for controlling the overall operation of the computer device 2, such as performing control and processing related to data interaction or communication with the computer device 2. In this embodiment, the processor 23 is configured to run the program code stored in the memory 21 or process data, for example, run the data processing apparatus 1.
The network interface 22 may comprise a wireless network interface or a wired network interface, and the network interface 22 is generally used to establish a communication link between the computer device 2 and other computer devices 2. For example, the network interface 22 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or other wireless or wired network.
It is noted that fig. 4 only shows the computer device 2 with components 21-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the data processing apparatus 1 stored in the memory 21 may be further divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 23) to complete the present invention.
Example four
To achieve the above objects, the present invention also provides a computer-readable storage medium including a plurality of storage media such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor 23, implements corresponding functions. The computer readable storage medium of the embodiment is used for storing the data processing apparatus 1, and when being executed by the processor 23, the computer readable storage medium implements the data processing method of the first embodiment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims (10)

1. A data processing method, comprising:
acquiring an original document;
preprocessing the original document to obtain data to be processed;
and identifying data to be processed by adopting a semantic model to obtain data elements and associated information corresponding to the data elements, and generating associated structured data corresponding to the original document according to the associated information and the data elements.
2. The data processing method of claim 1, wherein the preprocessing the original document to obtain data to be processed comprises:
and identifying the original document by adopting an optical character identification method to obtain the data to be processed.
3. The data processing method of claim 1, wherein the identifying the data to be processed by using the semantic model to obtain a data element and associated information corresponding to the data element, and generating associated structured data corresponding to the original document according to the associated information and the data element comprises:
performing word segmentation recognition on the data to be processed according to words in a financial dictionary by adopting a semantic model to obtain the data elements and the associated scores of the data elements;
determining the associated information of the data elements in the original document according to the positions, the types and the associated scores of the data elements in the original document;
generating associated structured data corresponding to the original document based on the data elements and the associated information corresponding to the data elements, wherein the associated structured data comprises at least one piece of structured data, each piece of structured data corresponds to one data element, and the structured data comprises paragraph positions, paragraph types, page numbers, associated scores and the data elements of the data elements in the original document.
4. The data processing method of claim 3, further comprising:
receiving a search request, wherein the search request comprises a target element and a search object;
according to a search object in the search request, determining an original document corresponding to the search object;
searching the associated structured data corresponding to the original document according to the target elements in the search request, and acquiring the target structured data corresponding to the target elements;
and determining corresponding content in the original document according to the target structured data.
5. The data processing method of claim 4, wherein the search object corresponds to one of the original documents, or the search object corresponds to a plurality of the original documents.
6. The data processing method according to claim 4, wherein the searching the associated structured data corresponding to the original document according to the target element in the search request to obtain the target structured data corresponding to the target element comprises:
searching all structured data in the associated structured data corresponding to the original document according to a target element in the search request, and acquiring candidate structured data matched with the target element;
and determining the target structured data matched with the target elements according to the obtained association scores of the data elements in the candidate structured data.
7. The data processing method of claim 1, wherein the original document and the associated structured data corresponding to the original document are verified using regular expressions.
8. A data processing apparatus, characterized by comprising:
an acquisition unit configured to acquire an original document;
the processing unit is used for preprocessing the original document to obtain data to be processed;
and the association unit is used for identifying data to be processed by adopting a semantic model to obtain a data element and association information corresponding to the data element, and generating association structured data corresponding to the original document according to the association information and the data element.
9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202210817671.1A 2022-07-12 2022-07-12 Data processing method and device, computer equipment and storage medium Pending CN115098642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210817671.1A CN115098642A (en) 2022-07-12 2022-07-12 Data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210817671.1A CN115098642A (en) 2022-07-12 2022-07-12 Data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115098642A true CN115098642A (en) 2022-09-23

Family

ID=83295941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210817671.1A Pending CN115098642A (en) 2022-07-12 2022-07-12 Data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115098642A (en)

Similar Documents

Publication Publication Date Title
WO2019153607A1 (en) Intelligent response method, electronic device and storage medium
CN108170715B (en) Text structuralization processing method
US9141853B1 (en) System and method for extracting information from documents
CN104834651B (en) Method and device for providing high-frequency question answers
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
US9098487B2 (en) Categorization based on word distance
CN113495900A (en) Method and device for acquiring structured query language sentences based on natural language
US10482323B2 (en) System and method for semantic textual information recognition
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
CN110362798B (en) Method, apparatus, computer device and storage medium for judging information retrieval analysis
CN113515629A (en) Document classification method and device, computer equipment and storage medium
CN114298035A (en) Text recognition desensitization method and system thereof
CN110597978A (en) Article abstract generation method and system, electronic equipment and readable storage medium
EP4141818A1 (en) Document digitization, transformation and validation
CN111506595B (en) Data query method, system and related equipment
US20230028664A1 (en) System and method for automatically tagging documents
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN113297852B (en) Medical entity word recognition method and device
CN109684357B (en) Information processing method and device, storage medium and terminal
WO2021042529A1 (en) Article abstract automatic generation method, device, and computer-readable storage medium
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
CN112818005B (en) Structured data searching method, device, equipment and storage medium
CN115098642A (en) Data processing method and device, computer equipment and storage medium
CN110909538B (en) Question and answer content identification method and device, terminal equipment and medium
CN115099213A (en) Information processing method and information processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination