CN111930805B - Information mining method and computer equipment - Google Patents

Information mining method and computer equipment Download PDF

Info

Publication number
CN111930805B
CN111930805B CN202010797241.9A CN202010797241A CN111930805B CN 111930805 B CN111930805 B CN 111930805B CN 202010797241 A CN202010797241 A CN 202010797241A CN 111930805 B CN111930805 B CN 111930805B
Authority
CN
China
Prior art keywords
keyword
label
report file
periodic report
information mining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010797241.9A
Other languages
Chinese (zh)
Other versions
CN111930805A (en
Inventor
吴智炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010797241.9A priority Critical patent/CN111930805B/en
Publication of CN111930805A publication Critical patent/CN111930805A/en
Application granted granted Critical
Publication of CN111930805B publication Critical patent/CN111930805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of data processing, and provides an information mining method, an information mining device, computer equipment and a computer readable storage medium. According to the information mining method, text decomposition is carried out on a periodic report file, a pre-built dictionary is utilized to identify content in a text fragment set obtained through decomposition, a keyword set is obtained, and because a trained classifier is used for describing a corresponding relation between keywords and type labels, the type identification is carried out on each keyword in the keyword set through the trained classifier, the corresponding type labels can be matched for each keyword, and because the type labels can distinguish meanings among a plurality of keywords in the keyword set, information mining results representing the characteristics of the periodic report file can be output based on the type labels of each keyword, and a scheme for carrying out information mining on the periodic report file is provided. The scheme of the application can also be applied to the field of block chains.

Description

Information mining method and computer equipment
Technical Field
The present invention relates to data processing and blockchain technologies, and in particular, to an information mining method, an information mining apparatus, a computer device, and a computer readable storage medium.
Background
Nowadays, in order to promote technological innovation, more and more fields develop user demands by deeply analyzing existing user information or data, and further develop products pursued by consumers.
In the existing information analysis technology, most of the data of users are counted to obtain a large amount of user information, and related user demands are determined by scientifically analyzing the user information and used as reference or guide for product research and development. Although, in the conventional information analysis schemes, the user data used is objective and without human intervention, for example, operation data when the user browses the product online, the type of the product browsed online by the user, and the like. But for some non-objectively summarized data or information, such as user's assessment information of the product, use hearts, etc.; for example, the user cannot analyze and utilize the work content by using the existing information analysis method, and thus, the application range of the information mining scheme is small in the existing information analysis technology.
Disclosure of Invention
In view of the above, the embodiments of the present application provide an information mining method, an information mining apparatus, a computer device, and a computer readable storage medium, so as to solve the problem that in the existing information analysis technology, the application range of the information mining scheme is smaller.
A first aspect of an embodiment of the present application provides an information mining method, including:
Performing text decomposition on the periodic report file to obtain a text fragment set;
Identifying the content in the text fragment set by utilizing a pre-constructed dictionary to obtain a keyword set;
Performing type recognition on each keyword in the keyword set through a trained classifier, and matching a corresponding type label for each keyword;
And outputting an information mining result based on the type label of each keyword.
A second aspect of an embodiment of the present application provides an information mining apparatus, including:
The decomposing unit is used for carrying out text decomposition on the periodic report file to obtain a text fragment set;
The first recognition unit is used for recognizing the content in the text fragment set by utilizing a pre-built dictionary to obtain a keyword set;
the second recognition unit is used for carrying out type recognition on each keyword in the keyword set through the trained classifier, and matching a corresponding type label for each keyword;
And the output unit is used for outputting an information mining result based on the type label of each keyword.
A third aspect of the embodiments of the present application provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the first aspect when executing the computer program.
A fourth aspect of the embodiments of the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the first aspect.
The information mining method, the information mining device, the computer equipment and the computer readable storage medium provided by the embodiment of the application have the following beneficial effects:
According to the embodiment of the application, the text of the periodic report file is decomposed, the content in the text fragment set obtained by decomposition is identified by utilizing a pre-built dictionary to obtain the keyword set, and the trained classifier is used for describing the corresponding relation between the keywords and the type labels, so that the type identification is carried out on each keyword in the keyword set through the trained classifier, the corresponding type label can be matched for each keyword, and the type labels can distinguish the meanings among a plurality of keywords in the keyword set, so that the information mining result representing the characteristics of the periodic report file can be output based on the type label of each keyword, a scheme for carrying out information mining is provided for the periodic report file, and the application range of the information mining scheme is expanded.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of an implementation of an information mining method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating an implementation of an information mining method according to another embodiment of the present application;
FIG. 3 is a flowchart illustrating an implementation of an information mining method according to still another embodiment of the present application;
FIG. 4 is a flowchart illustrating an implementation of an information mining method according to another embodiment of the present application;
Fig. 5 is a block diagram of an information mining apparatus according to an embodiment of the present application;
fig. 6 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
According to the embodiment of the application, the text of the periodic report file is decomposed to obtain the text fragment set, then the content in the text fragment set is identified by utilizing the pre-built dictionary to obtain the keyword set, and the trained classifier is used for describing the corresponding relation between the keywords and the type labels, so that the type identification is carried out on each keyword in the keyword set through the trained classifier, the corresponding type label can be matched for each keyword, and the type labels can distinguish the meanings among a plurality of keywords in the keyword set, so that the information mining result representing the characteristics of the periodic report file can be output based on the type label of each keyword, a scheme for carrying out information mining is provided for the periodic report file, and the application range of the information mining scheme is expanded.
In the embodiment of the present application, the execution subject of the flow is a server, which includes but is not limited to: and a device capable of executing the information mining method, such as a computer, a smart phone, a tablet computer and the like. Fig. 1 shows a flowchart of an implementation of the information mining method according to the first embodiment of the present application, which is described in detail below:
s11: and carrying out text decomposition on the periodic report file to obtain a text fragment set.
In step S11, the contents in the periodic report file are edited by the user, where the contents in the periodic report file are used to describe the working contents or calendar of the user in a certain period. Such as development progress of a project, sales of a product, profile content of a calendar engaged in a business activity, etc.
Before the server performs text decomposition on the periodic report file, the periodic report file may be uploaded to a preset server for storing the periodic report file through a terminal logged in the user ID. When the periodic report file is subjected to information mining, the server sends a file acquisition request to the preset server, the request carries a user ID, and the preset server pulls the corresponding periodic report file according to the user ID and sends the file acquisition request to the server. Or the preset storage server synchronizes the stored periodic report files to the server according to a preset information synchronization strategy. And then or a data uploading interface shared by the server and a preset server is configured for the server, and the terminal logging in the user ID uploads the periodic report file to the server and the preset server through the shared data uploading interface.
In this embodiment, the content in the periodic report file may include format content such as a chart and text, and text decomposition may be performed on the format content, or a process of identifying text or text content in the chart, and then recombining continuous word sequences in the periodic report file into word sequences according to a certain specification. The method specifically can be to eliminate punctuation marks contained in text contents in a periodic report file by utilizing an existing text word segmentation strategy, and distinguish words from phrases in the text contents in the periodic report file, so that recognition of word sequences is realized, and decomposition of the whole text is completed. Or firstly, identifying punctuation marks of text contents in the periodic report file, wherein the punctuation marks of the sentence breaking can comprise commas, semicolons, periods, exclamation marks, question marks and the like; and segmenting text content in the periodic report file into sentence sets according to the recognized punctuation marks, removing punctuation marks contained in the sentences before distinguishing words from phrases for each sentence in the sentence sets, and finally carrying out vocabulary recognition for all sentences so as to complete the decomposition of the whole text.
As a possible implementation manner, the periodic report file is a plain text file, and step S11 specifically includes:
Determining punctuation marks of the sentence breaking points in the periodical report file and position information corresponding to each punctuation mark;
Based on the position information corresponding to each punctuation mark, carrying out sentence breaking processing on the content of the periodic report file to obtain a plurality of original text fragments;
And eliminating unintentional words and punctuation marks of the sentence breaking in each original text fragment to obtain a text fragment set.
In this embodiment, in order to improve the practicality of the text segment set, when text decomposition is performed on the periodic report file of the user, not only punctuation marks need to be removed, but also words that cannot be distinguished by actual content and appear in the text, for example, uncertainty words, such as "ok", "if", "perhaps", "if", and the like, need to be removed. As another example, the term "Qi" means "such as" o "," y "and the like.
It should be understood that, in a text fragment set obtained by decomposing the text of the periodic report file, a plurality of text fragments are included, and punctuation marks are included in the text fragments, in order to avoid misidentifying the punctuation marks as words or phrases, after decomposing the text of the periodic report file into a plurality of sentences, the punctuation marks in each sentence are removed, and then word and phrase distinction is performed, so that word sequence identification is realized.
In practical applications, since the content in the periodic report file is used to describe the working content or working result of the user in a certain period, for example, a weekly work report file, a monthly work report file, a quarterly work report file or an annual work report file, punctuation marks other than commas, semicolons and periods, which indicate the ending of sentences, are not present in the periodic report file, that is, question marks, exclamation marks, and the like are not present.
It should be understood that, since the content of the periodic report file is the content to be subjected to information mining, text decomposition is performed on the periodic report file to obtain a text segment set, so as to refine the content particles of the periodic report file, so that the content to be subjected to information mining is decomposed and refined, and an implementation basis is provided for information mining.
S12: and identifying the content in the text fragment set by utilizing a pre-constructed dictionary to obtain a keyword set.
In S12, a pre-constructed dictionary is constructed based on corpus samples. And the keywords are obtained by carrying out keyword recognition on the content in the text fragment set for a pre-built dictionary.
It should be noted that the pre-built dictionary describes a correspondence between the keywords and the usage rate, and based on the correspondence, the pre-built dictionary can recognize the keywords from the text fragments. Because the use rates of different keywords are different, the keywords can be ordered according to the different use rates in the keyword set output by the pre-constructed dictionary.
In practical application, the corpus sample may be obtained by splitting a previous periodic report file to obtain corresponding paragraphs or sentences, inputting the paragraphs or sentences as corpus samples into a dictionary construction tool, and constructing a dictionary according to the input corpus sample by the dictionary construction tool, so as to output an ordered dictionary and a word segmentation structure of the corpus sample, wherein the word segmentation structure of the corpus sample is consistent with the word segmentation structure of the corpus sample described by the dictionary.
S13: and carrying out type recognition on each keyword in the keyword set through a trained classifier, and matching a corresponding type label for each keyword.
In step S13, the trained classifier is used to describe the correspondence between each keyword in the keyword set and the type label. Type labels are used to distinguish meaning between multiple keywords in a keyword set.
In this embodiment, the type tag may include: "department," project, "" job, "" product, "" index, "" activity, "" reputation, "and/or" performance. For example, when the first keyword is "sales department business" and the second keyword is "a product", the third keyword is "sales index", the fourth keyword is "most potential employee", the fifth keyword is "sales total xxx ten thousand", and the type label of the first keyword "sales department business" is "department" and/or "function", the type label of the second keyword "a product" is "project" and/or "product", the type label of the third keyword "sales index" is "index", the type label of the fourth keyword "most potential employee" is "activity" and/or "honor", and the type label of the fifth keyword "sales total xxx ten thousand" is "score".
It should be noted that, in all embodiments of the present application, the Classifier (Classifier) is obtained by constructing a classification model based on existing data. The classifier can comprise algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like, and can be used for mapping data records in a database to one of given categories, so that the classifier can be applied to the field of information mining, the classifier is trained by utilizing a pre-configured sample set, the trained classifier can be obtained, and each keyword in the keyword set can be mapped to one of the type labels in the given categories by carrying out semantic recognition on the keywords, namely, the corresponding type label is matched based on the semantic recognition result of the keyword.
S14: and outputting an information mining result based on the type label of each keyword.
In step S14, the information mining result is used to describe the content obtained by mining the periodic report file, where the information mining result may be represented in the content of the original periodic report file, or the information mining result file associated with the periodic report file is generated in addition.
In this embodiment, the actual presentation manner of the information mining result may be configured or adjusted according to the actual requirement, and no matter how the information mining result is presented, the information mining result may represent the content emphasis or the characteristics of the periodic report file.
As an example, in the content of the original periodic report file, the information mining result may be to use the type tag matched with the keyword as the annotation content, identify the position of the keyword from the periodic report file, and display the corresponding annotation content at the position of each keyword. Here, the corresponding annotation content is displayed at the position of each keyword, so that the key points can be distinguished in the content of the periodic report file, and the readability of the periodic report file is higher.
As an example, the information mining result is a file which is additionally generated and associated with the periodic report file, and the category label of the keyword may be configured into a preset file template, so as to obtain a mining result file, where the file template may be a script template which is preconfigured based on the application requirement of the information mining result. Because the object for information mining is a periodic report file, and the content of the periodic report file describes the work content or work result of the user in a certain period, the application requirement can be a requirement configured based on the work content or work result, for example, a script template of a video resource pushing requirement, a script template of team performance integration, and the like. Here, the class labels are configured in a preset file template, the obtained mining result can be a target script file, and based on the target script file, the data mining result can be directly applied to actual service requirements, so that the utilization rate of the information mining result is improved.
As a possible implementation manner, step S14 may specifically include:
And carrying out label configuration on the content in the periodic report file based on the type label of each keyword by using the trained label configuration model to obtain a new periodic report file.
In this embodiment, the trained tag configuration model is used to describe the correspondence between the target tag and the type tag. Type labels are used to distinguish meaning between multiple keywords in a keyword set. The target label is different from the type label, and the target label is used for distinguishing a numerical label and a non-numerical label in the type label, wherein the keyword corresponding to the numerical label is numerical content, the numerical label can be 'performance', 'index', and the like, the keyword corresponding to the non-numerical label is non-numerical content, and the non-numerical label can be 'department', 'project', 'function', 'product', 'activity', 'honor', and the like.
In practical application, the label configuration model may be an original model constructed based on the BERT serialization labeling model, and the label configuration model after training is obtained by labeling and training the original model.
Before training the original model, a regular expression between the target label and the type label is constructed, regular matching is performed on the type label obtained by recognition in the sample corpus, so that the target label corresponding to the numerical type label in the type label is a numerical value, the target label corresponding to the non-numerical type label is a keyword, and the data obtained by regular matching is used for training the original model, so that a trained label configuration model can be obtained.
In practical application, a regular expression between the target label and the type label is constructed, namely, the semantics between the target label and the type label can be mined by utilizing the regular expression, and then the regular expression is obtained based on the semantic configuration between the target label and the type label.
It can be understood that in the process of constructing the regular expression between the target label and the type label, the regular expression can be adjusted according to the accuracy, so that the constructed regular expression can completely describe the corresponding relation between the target label and the type label.
According to the scheme, the text decomposition is carried out on the periodical report file, so that the interference can be removed, the content incapable of carrying out information mining and punctuation marks are removed, namely, the first corpus screening is realized, the text segment set with finer granularity is obtained, because the pre-built dictionary is obtained by constructing based on the word length and word frequency of the seed word, the content in the text segment set is identified by utilizing the pre-built dictionary, the keyword set with information mining value can be obtained, the classifier is utilized to classify the keyword set to realize the second screening, a plurality of keyword categories are obtained, and because the trained classifier is used for describing the corresponding relation between each keyword in the keyword set and the meaning represented by the keyword set, the qualitative operation of each keyword in the periodical report file can be completed by classifying the keyword set by utilizing the classifier, the information mining result can be output based on the keyword categories, the information mining is completed, the scheme for carrying out information mining on the periodical report file is provided, and the application range of the information mining scheme is expanded. In addition, the application range and the multiplexing rate of the periodic report file are also developed.
Fig. 2 shows a flowchart of an implementation of an information mining method according to another embodiment of the present application. Referring to fig. 2, compared to the embodiment shown in fig. 1, in the information mining method provided in this embodiment, before the step of performing text decomposition on the periodic report file to obtain a text segment set, the method further includes: s21 to S22 are specifically described as follows:
Further, before the step of performing text decomposition on the periodic report file to obtain the text segment set, the method further includes:
S21: a request for acquiring a periodic report file is sent to a preset server; the preset server is used for receiving a periodic report file uploaded by a terminal logged in a user ID, and storing the periodic report file in association with the user ID.
S22: and receiving a periodic report file returned by the preset server according to the request and a corresponding user ID.
In this embodiment, the server sends a request for acquiring the periodic report file to a preset server, where the preset server is different from the server, and a database is built in the preset server, and information in the database is used to describe a correspondence between the periodic report file and the user ID. The user ID is used for distinguishing the belongings of the periodical report files, and is also a unique identifier for the user to interact with the preset server and the server respectively through the terminal.
In practical application, since the content in the periodic report file is used for describing the working content or calendar of the user in a certain period and is edited by the user, in order to ensure that the corresponding relationship between the periodic report file and the user is unchanged, the user uploads the periodic report file to a preset server through a terminal logged in the user ID, and then the preset server stores the periodic report file in association with the user ID.
It should be noted that, for a scenario with a large amount of information, the storage of different information needs to consider the area division, and the periodic report file exists in a plurality of different mechanism libraries. It is therefore necessary to extract the periodic report file into a server for information mining.
In practical application, more periodic report files can be synchronized into a server for information mining through a synchronization tool kafka. In order to prevent data loss in the data synchronization process, the leader epoch algorithm may be used to solve the problem, and it should be understood that the use of the leader epoch algorithm to achieve data synchronization is an existing technical means in the prior art, so that a detailed description is omitted here.
In all embodiments of the present application, under the circumstance that information mining needs to be provided for a large number of users, data mining and data collection are separated, that is, a preset server is set to receive a periodic report file uploaded by a terminal logged in a user ID, and the periodic report file is stored in association with the user ID, when the user needs to perform information mining, a request for acquiring the periodic report file is sent to the preset server, the periodic report file returned by the preset server according to the request and the corresponding user ID are received, the server obtains the periodic report file and the corresponding user ID from the preset server, provides the execution environment and execution resources for information mining, avoids the access information impact on the preset server when the information mining needs are too much, and further avoids the phenomenon that the content of the periodic report file is tampered due to the fact that the data mining is directly carried out on the periodic report file uploaded by the user, so that the integrity of the source data is ensured.
Fig. 3 is a flowchart illustrating an implementation of an information mining method according to still another embodiment of the present application. Referring to fig. 3, with respect to the embodiment shown in fig. 2, in the information mining method provided in this embodiment, after the step of outputting the information mining result based on the type tag of each keyword, the method further includes: s31 to S32 are specifically described as follows:
S31: and generating a resource pushing strategy based on the information mining result.
S32: and after the resource pushing strategy is associated with the user ID, packaging and sending the resource pushing strategy to a target server, so that the target server recommends corresponding resources to the terminal which has logged in the user ID according to the resource pushing strategy.
In this embodiment, the resource push policy is used to describe the manner in which the target server recommends resources to the terminal that has logged in to the user ID. The information mining result at least comprises information for reflecting the user classification corresponding to the user ID, such as a type tag. Accordingly, the generated resource pushing policy can be obtained based on type labels, and the type labels are used for distinguishing meanings among a plurality of keywords in the keyword set, so that the type label page can distinguish the requirements of different user IDs on resources.
It should be noted that, unlike the server or the preset server, the target server is a server storing resources. Here, the resource may be a video resource, a text resource, etc., and the target server is a server for storing the video resource or the text resource. Taking the resource as a video resource as an example, when the information mining result is a type tag, the type tag can comprise at least one of a department, an item, a function, a product, an index, an activity, a honor and a performance, and the video resource is screened from a video database based on the type tag and is pushed to a terminal with a logged-in user ID.
In practical application, the server executing the information mining method provided by all embodiments of the present application shares the user database with the preset server and the target server, that is, different application programs are logged in on the terminal by using the same user ID, and data in the server, the preset server and the target server are accessed respectively through the different application programs.
In the scheme of the embodiment, the server generates the resource pushing strategy based on the information mining result, associates the resource pushing strategy with the user ID, packages and sends the resource pushing strategy to the target server, so that the target server recommends corresponding resources to the terminal with the logged-in user ID according to the resource pushing strategy, after the periodic report file is subjected to information mining, the target server can recommend the resource content which is more matched with the actual situation of the user to the terminal with the logged-in user ID according to the resource pushing strategy obtained by the information mining result, namely, the resource pushing mode and pushing content of the target server are optimized by utilizing the information mining result, and meanwhile, the utilization rate of the information mining result is improved.
As one embodiment, after the step of outputting the information mining result based on the type tag of each keyword, the method further includes: and uploading the information mining result to a blockchain node.
In this embodiment, in order to share the result of information mining on the periodic report file, the information mining result is uploaded to the blockchain, so as to avoid that the information mining result corresponding to the periodic report file is not tampered.
In all embodiments of the present application, based on the type tag of each keyword, the information mining result is output, and uploading the information mining result to the blockchain can ensure the security and the fairness and transparency to the user. In actual use, the information mining results may be downloaded from the blockchain so that the information mining results are tampered with. The blockchain referred to in this example is a novel mode of application for computer technology such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
Fig. 4 is a flowchart illustrating an implementation of an information mining method according to another embodiment of the present application. Referring to fig. 4, with respect to any one of the embodiments of fig. 1 to 3, in the information mining method provided in this embodiment, before the step of identifying the content in the text segment set by using a pre-constructed dictionary to obtain a keyword set, the method further includes: s41, specifically described below:
S41: and constructing a dictionary by using a dictionary construction tool according to a preset corpus sample.
In step S41, the dictionary construction tool provides a configuration interface of the dictionary construction strategy for the user, through which the user realizes configuration of the recognition strategy, and performs dictionary construction based on the content of the corpus sample according to the recognition strategy.
In practice, the dictionary construction tool may be constructed using an automatic phrase mining framework Autophrase and a new word discovery algorithm Topwords. When the dictionary construction tool is used for constructing the dictionary according to the input corpus sample, a word length threshold value and a word frequency threshold value are required to be considered, so that after the corpus sample is input into the dictionary construction tool, based on a recognition strategy configured in the dictionary construction tool, the words with the word length smaller than the word length threshold value and the occurrence frequency equal to or larger than the word frequency threshold value can be obtained from the corpus sample, an original dictionary is further formed, and after numerical conversion is carried out on the content in the original dictionary, the use rate of each word is obtained through calculation by adopting an EM algorithm, and words with lower use rates are removed, so that a final dictionary is obtained.
In this embodiment, the content in the text segment set is identified by using a pre-constructed dictionary, specifically considering the length of each text segment vocabulary in the text segment set and the occurrence frequency of the vocabulary, that is, the vocabulary with proper vocabulary length and high occurrence frequency in the text segment set is used as a keyword, so as to obtain a keyword set.
It should be understood that, in this embodiment, when the content in the text segment set is identified by using the pre-constructed dictionary, the length and the frequency of the vocabulary appearing in each text segment in the text segment set are considered, so that the vocabulary capable of being mined in detail can be effectively identified from the text segments as the keywords, and the efficiency of identifying the keywords from the text segment set is improved.
According to the scheme, the text decomposition is carried out on the periodical report file, so that interference can be removed, content incapable of carrying out information mining and punctuation marks are removed, namely, first corpus screening is achieved, a text segment set with finer granularity is obtained, because a pre-built dictionary is built based on word length and word frequency of seed words, the content in the text segment set is identified by the pre-built dictionary, a keyword set with information mining value can be obtained, the keyword set is classified by a classifier to achieve secondary screening, a plurality of keyword categories are obtained, and because the trained classifier is used for describing the corresponding relation between each keyword in the keyword set and the meaning represented by each keyword in the keyword set, qualitative operation of each keyword in the periodical report file can be completed by classifying the keyword set by the classifier, further information mining results can be output on the basis of the keyword categories, information mining is completed, not only the information utilization rate is improved, but also the application range and the multiplexing rate of the periodical report file are also opened.
In addition, a resource pushing strategy is generated based on the information mining result, the resource pushing strategy is associated with the user ID and then packaged and sent to the target server, so that the target server recommends corresponding resources to the terminal logged in the user ID according to the resource pushing strategy, the resource pushing mode and pushing content of the target server are optimized by using the information mining result, and meanwhile the utilization rate of the information mining result is improved.
Referring to fig. 5, fig. 5 is a block diagram illustrating a structure of an information mining apparatus according to an embodiment of the present application. The mobile terminal in this embodiment includes units for performing the steps in the embodiments corresponding to fig. 1 to 4. Please refer to fig. 1 to fig. 4 and the related descriptions in the embodiments corresponding to fig. 1 to fig. 4. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 5, the information mining apparatus 50 includes: a decomposition unit 51, a first recognition unit 52, a second recognition unit 53, and an output unit 54, wherein:
the decomposing unit 51 is configured to perform text decomposition on the periodic report file to obtain a text fragment set.
The first recognition unit 52 is configured to recognize the content in the text segment set by using a pre-constructed dictionary, so as to obtain a keyword set.
And a second identifying unit 53, configured to perform type identification on each keyword in the keyword set through a trained classifier, and match a corresponding type label for each keyword.
An output unit 54, configured to output an information mining result based on the type tag of each of the keywords.
As an embodiment of the present application, the information mining apparatus 50 further includes:
A first sending unit 55, configured to send a request for acquiring a periodic report file to a preset server; the preset server is used for receiving a periodic report file uploaded by a terminal logged in a user ID, and storing the periodic report file in association with the user ID.
And a receiving unit 56, configured to receive the periodic report file and the corresponding user ID returned by the preset server according to the request.
As an embodiment of the present application, the information mining apparatus 50 further includes:
A policy generating unit 57, configured to generate a resource pushing policy based on the information mining result.
And the second sending unit 58 is configured to associate the resource pushing policy with the user ID, and then package and send the resource pushing policy to the target server, so that the target server recommends corresponding resources to the terminal that has logged in the user ID according to the resource pushing policy.
As an embodiment of the present application, the information mining apparatus 40 further includes:
the dictionary construction unit 59 is configured to perform dictionary construction according to a preset corpus sample by using a dictionary construction tool.
It should be understood that, in the block diagram of the information mining apparatus shown in fig. 5, each unit is configured to perform each step in the embodiments corresponding to fig. 1 to 4, and each step in the embodiments corresponding to fig. 1 to 4 has been explained in detail in the foregoing embodiments, and specific reference is made to fig. 1 to 4 and related descriptions in the embodiments corresponding to fig. 1 to 4, which are not repeated herein.
Fig. 6 is a block diagram of a computer device according to another embodiment of the present application. As shown in fig. 6, the computer device 60 of this embodiment includes: a processor 61, a memory 62 and a computer program 63 stored in said memory 62 and executable on said processor 61, for example a program of an information mining method. The steps of the respective embodiments of the information mining method described above are implemented by the processor 61 when executing the computer program 63, for example, S11 to S14 shown in fig. 1, or S21 to S14 and S21 to S32 shown in fig. 2 and 3, or S21 to S32 and S41 shown in fig. 4. Or the processor 61 performs the functions of each unit in the embodiment corresponding to fig. 5, for example, the functions of the units 41 to 49 shown in fig. 5 when executing the computer program 63, refer to the related descriptions in the embodiment corresponding to fig. 5, which are not repeated here.
Illustratively, the computer program 63 may be partitioned into one or more units that are stored in the memory 62 and executed by the processor 61 to complete the present application. The one or more elements may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program 63 in the computer device 60. For example, the computer program 63 may be divided into a query unit, a display unit, a transmission unit and a reception unit, each unit functioning specifically as described above.
The turntable device may include, but is not limited to, a processor 61, a memory 62. It will be appreciated by those skilled in the art that fig. 6 is merely an example of a computer device 60 and is not intended to be limiting of the computer device 60, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the turntable device may also include an input-output device, a network access device, a bus, etc.
The Processor 61 may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 62 may be an internal storage unit of the computer device 60, such as a hard disk or a memory of the computer device 60. The memory 62 may also be an external storage device of the computer device 60, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the computer device 60. Further, the memory 62 may also include both internal and external storage units of the computer device 60. The memory 62 is used for storing the computer program as well as other programs and data required by the turntable device. The memory 62 may also be used to temporarily store data that has been output or is to be output.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (7)

1. An information mining method, comprising:
Performing text decomposition on the periodic report file to obtain a text fragment set;
Identifying the content in the text fragment set by utilizing a pre-constructed dictionary to obtain a keyword set;
Performing type recognition on each keyword in the keyword set through a trained classifier, and matching a corresponding type label for each keyword; the type labels are used for distinguishing meanings among a plurality of keywords in the keyword set;
Outputting an information mining result based on the type label of each keyword; the information mining result is that the type label matched with the keyword is used as annotation content in the content of the original periodic report file, the position of the keyword is identified from the periodic report file, and the corresponding annotation content is displayed at the position of each keyword;
Before the step of performing text decomposition on the periodic report file to obtain the text fragment set, the method further comprises the following steps:
a request for acquiring a periodic report file is sent to a preset server; the preset server is used for receiving a periodic report file uploaded by a terminal logged in a user ID, and storing the periodic report file in association with the user ID;
receiving a periodic report file and a corresponding user ID returned by the preset server according to the request;
Before the step of identifying the content in the text segment set by utilizing the pre-constructed dictionary to obtain the keyword set, the method further comprises the following steps:
Inputting a corpus sample into a dictionary construction tool, obtaining words with word lengths smaller than word length threshold values and occurrence frequencies equal to or larger than word frequency threshold values from the corpus sample based on a recognition strategy configured in the dictionary construction tool to form an original dictionary, performing numerical conversion and conversion on contents in the original dictionary, and removing words with low use rate to obtain the pre-constructed dictionary;
The outputting the information mining result based on the type label of each keyword comprises the following steps:
Performing label configuration on the content in the periodic report file based on the type label of each keyword by using the trained label configuration model to obtain a new periodic report file; the trained label configuration model is used for describing the corresponding relation between a target label and the type label, the target label is used for distinguishing a numerical label and a non-numerical label in the type label, the numerical label corresponds to keywords and is numerical content, and the non-numerical label corresponds to keywords and is non-numerical content; the label configuration model is constructed based on the BERT serialization labeling model to obtain an original model; before training the original model, performing regular matching on type labels obtained by identification in sample corpus by constructing a regular expression between the target labels and the type labels, so that the target labels corresponding to the numerical type labels in the type labels are numerical values, the target labels corresponding to the non-numerical type labels are keywords, and training the original model by utilizing data obtained by the regular matching to obtain a trained label configuration model;
and outputting the new periodic report file as the information mining result.
2. The method of claim 1, wherein the performing text decomposition on the periodic report file to obtain a text segment set includes:
Determining punctuation marks of the sentence breaking points in the periodical report file and position information corresponding to each punctuation mark;
Based on the position information corresponding to each punctuation mark, carrying out sentence breaking processing on the content of the periodic report file to obtain a plurality of original text fragments;
And eliminating unintentional words and punctuation marks of the sentence breaking in each original text fragment to obtain a text fragment set.
3. The information mining method according to claim 1, wherein after the step of outputting the information mining result based on the type tag of each of the keywords, further comprising:
generating a resource pushing strategy based on the information mining result;
and after the resource pushing strategy is associated with the user ID, packaging and sending the resource pushing strategy to a target server, so that the target server recommends corresponding resources to the terminal which has logged in the user ID according to the resource pushing strategy.
4. The information mining method according to any one of claims 1 to 3, characterized by further comprising, after said step of outputting an information mining result based on a type tag of each of said keywords:
And uploading the information mining result to a blockchain node.
5. An information mining apparatus, comprising:
The decomposing unit is used for carrying out text decomposition on the periodic report file to obtain a text fragment set;
The first recognition unit is used for recognizing the content in the text fragment set by utilizing a pre-built dictionary to obtain a keyword set;
The second recognition unit is used for carrying out type recognition on each keyword in the keyword set through the trained classifier, and matching a corresponding type label for each keyword; the type labels are used for distinguishing meanings among a plurality of keywords in the keyword set;
the output unit is used for outputting an information mining result based on the type label of each keyword; the information mining result is that the type label matched with the keyword is used as annotation content in the content of the original periodic report file, the position of the keyword is identified from the periodic report file, and the corresponding annotation content is displayed at the position of each keyword;
The information mining apparatus is further configured to: before the step of carrying out text decomposition on the periodic report file to obtain a text fragment set, a request for acquiring the periodic report file is sent to a preset server; receiving a periodic report file and a corresponding user ID returned by the preset server according to the request; the preset server is used for receiving a periodic report file uploaded by a terminal logged in a user ID, and storing the periodic report file in association with the user ID;
The dictionary construction unit is used for inputting a corpus sample into a dictionary construction tool, obtaining words with word lengths smaller than a word length threshold value and occurrence frequencies equal to or larger than a word frequency threshold value from the corpus sample based on a recognition strategy configured in the dictionary construction tool to form an original dictionary, carrying out numerical value conversion and conversion on contents in the original dictionary, and removing words with low use rate to obtain the pre-constructed dictionary;
The outputting the information mining result based on the type label of each keyword comprises the following steps: performing label configuration on the content in the periodic report file based on the type label of each keyword by using the trained label configuration model to obtain a new periodic report file; the new periodic report file is output as the information mining result; the trained label configuration model is used for describing the corresponding relation between a target label and the type label, the target label is used for distinguishing a numerical label and a non-numerical label in the type label, the numerical label corresponds to keywords and is numerical content, and the non-numerical label corresponds to keywords and is non-numerical content; the label configuration model is constructed based on the BERT serialization labeling model to obtain an original model; before training the original model, performing regular matching on the type labels obtained by recognition in the sample corpus by constructing a regular expression between the target labels and the type labels, so that the target labels corresponding to the numerical type labels in the type labels are numerical values, the target labels corresponding to the non-numerical type labels are keywords, and training the original model by using data obtained by the regular matching to obtain the trained label configuration model.
6. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program steps of the method according to any of claims 1 to 4.
7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 4.
CN202010797241.9A 2020-08-10 2020-08-10 Information mining method and computer equipment Active CN111930805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010797241.9A CN111930805B (en) 2020-08-10 2020-08-10 Information mining method and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010797241.9A CN111930805B (en) 2020-08-10 2020-08-10 Information mining method and computer equipment

Publications (2)

Publication Number Publication Date
CN111930805A CN111930805A (en) 2020-11-13
CN111930805B true CN111930805B (en) 2024-08-13

Family

ID=73308041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010797241.9A Active CN111930805B (en) 2020-08-10 2020-08-10 Information mining method and computer equipment

Country Status (1)

Country Link
CN (1) CN111930805B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597300A (en) * 2020-12-15 2021-04-02 中国平安人寿保险股份有限公司 Text clustering method and device, terminal equipment and storage medium
CN112861980B (en) * 2021-02-21 2021-09-28 平安科技(深圳)有限公司 Calendar task table mining method based on big data and computer equipment
CN113254840B (en) * 2021-06-22 2021-11-16 中电科新型智慧城市研究院有限公司 Artificial intelligence application service pushing method, pushing platform and terminal equipment
CN113688179B (en) * 2021-08-16 2022-04-08 北京科豆加速器科技有限公司 User data management system based on front-end APP and back-end platform
CN113688206A (en) * 2021-08-25 2021-11-23 平安国际智慧城市科技股份有限公司 Text recognition-based trend analysis method, device, equipment and medium
CN114140077A (en) * 2021-11-30 2022-03-04 宁波帮企一把企业服务平台有限公司 Government policy deconstruction method, device, computer equipment and storage medium
CN115374284B (en) * 2022-10-26 2023-04-07 江苏益柏锐信息科技有限公司 Data mining method and server based on artificial intelligence
CN115688759B (en) * 2022-11-07 2023-11-07 北京北明数科信息技术有限公司 Method, system, computer equipment and medium for classifying reported information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932294A (en) * 2018-05-31 2018-12-04 平安科技(深圳)有限公司 Resume data processing method, device, equipment and storage medium based on index
CN109815333A (en) * 2019-01-14 2019-05-28 金蝶软件(中国)有限公司 Information acquisition method, device, computer equipment and storage medium
CN110597988A (en) * 2019-08-28 2019-12-20 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN110795911A (en) * 2019-09-16 2020-02-14 中国平安人寿保险股份有限公司 Real-time adding method and device of online text label and related equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628906B (en) * 2017-03-24 2021-01-26 北京京东尚科信息技术有限公司 Short text template mining method and device, electronic equipment and readable storage medium
CN110378563A (en) * 2019-06-18 2019-10-25 平安普惠企业管理有限公司 Information processing method, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932294A (en) * 2018-05-31 2018-12-04 平安科技(深圳)有限公司 Resume data processing method, device, equipment and storage medium based on index
CN109815333A (en) * 2019-01-14 2019-05-28 金蝶软件(中国)有限公司 Information acquisition method, device, computer equipment and storage medium
CN110597988A (en) * 2019-08-28 2019-12-20 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN110795911A (en) * 2019-09-16 2020-02-14 中国平安人寿保险股份有限公司 Real-time adding method and device of online text label and related equipment

Also Published As

Publication number Publication date
CN111930805A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN111930805B (en) Information mining method and computer equipment
US20230334254A1 (en) Fact checking
US11514235B2 (en) Information extraction from open-ended schema-less tables
US10387784B2 (en) Technical and semantic signal processing in large, unstructured data fields
US7912816B2 (en) Adaptive archive data management
US20120290293A1 (en) Exploiting Query Click Logs for Domain Detection in Spoken Language Understanding
US8577823B1 (en) Taxonomy system for enterprise data management and analysis
CN106250385A (en) The system and method for the abstract process of automated information for document
Wicker et al. Multi-label classification using boolean matrix decomposition
US20160012082A1 (en) Content-based revision history timelines
CN110765101B (en) Label generation method and device, computer readable storage medium and server
US20210209358A1 (en) Methods and systems for facilitating classification of portions of a regulatory document using multiple classification codes
CN112148702B (en) File retrieval method and device
WO2012158572A2 (en) Exploiting query click logs for domain detection in spoken language understanding
CN114357117A (en) Transaction information query method and device, computer equipment and storage medium
Das et al. A CV parser model using entity extraction process and big data tools
Eykens et al. Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches
CN115374781A (en) Text data information mining method, device and equipment
CN110737824A (en) Content query method and device
Khemani et al. A review on reddit news headlines with nltk tool
US20210209357A1 (en) Methods and systems for facilitating determination of differences between a pair of regulatory documents
US20210209355A1 (en) Methods and systems for facilitating classification of portions of a regulatory document
Gürbüz et al. Research article classification with text mining method
Gupta et al. Large-scale information extraction from emails with data constraints
CN112069807A (en) Text data theme extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant