CN110717096A

CN110717096A - Bill data extraction method and device, computer equipment and storage medium

Info

Publication number: CN110717096A
Application number: CN201910842800.0A
Authority: CN
Inventors: 王可鹏
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Shenzhen Ping An Medical Health Technology Service Co Ltd
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2020-01-21
Anticipated expiration: 2039-09-06
Also published as: CN110717096B

Abstract

The application relates to a data processing technology and provides a document data extraction method and device, computer equipment and a storage medium. The method comprises the following steps: acquiring a user detection data set; clustering the user detection data set according to the characteristic vector corresponding to each user detection data in the user detection data set to obtain an initial clustering result; correcting the initial clustering result according to a detection result corresponding to the user detection data set to obtain a target clustering result; determining a clustering cluster to which the detection result of the known category belongs based on the target clustering result; determining a category corresponding to each detection result according to the known category and the clustering cluster; and extracting bill data from the user detection data according to the type corresponding to the detection result in the user detection data. By adopting the method, the accuracy of document data extraction can be improved.

Description

Bill data extraction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for extracting document data, a computer device, and a storage medium.

Background

With the continuous development of science and technology, the corresponding detection result of the user is determined in a mode of combining multiple ways, the product recommended to the user and the corresponding product recommendation information are determined according to the detection result, and corresponding document data are generated based on the user information and the product recommendation information for the user to refer and select. The corresponding product recommendation information is usually determined by a bill manager according to the detection result. However, the product recommendation information determined based on the method is limited by the experience and qualification of the document manager, so that the matching degree of part of the product recommendation information and the user information is low. Therefore, how to improve the matching degree between the product recommendation information and the user information by means of the product recommendation information determined by the document manager with rich experience and high qualification is a considerable problem.

Currently, corresponding product recommendation information is generally extracted from user detection data corresponding to a single detection result, corresponding document data is determined based on the extracted product recommendation information, and the document data is pushed to a document administrator terminal for reference of a document administrator. However, the product recommendation information and the document data extraction method have the problems of single considered dimension and low document data extraction accuracy.

Disclosure of Invention

In view of the above, it is necessary to provide a document data extraction method, apparatus, computer device and storage medium capable of improving the document data extraction accuracy.

A document data extraction method, the method comprising:

acquiring a user detection data set;

clustering the user detection data set according to the characteristic vector corresponding to each user detection data in the user detection data set to obtain an initial clustering result;

correcting the initial clustering result according to a detection result corresponding to the user detection data set to obtain a target clustering result;

determining a clustering cluster to which the detection result of the known category belongs based on the target clustering result;

determining a category corresponding to each detection result according to the known category and the clustering cluster;

and extracting bill data from the user detection data according to the type corresponding to the detection result in the user detection data.

In one embodiment, the clustering the user detection data set according to the feature vector corresponding to each user detection data in the user detection data set to obtain an initial clustering result includes:

extracting feature data from each user detection data in the user detection data set;

determining a feature vector corresponding to the user detection data according to the feature data;

and clustering the user detection data set according to the characteristic vector to obtain an initial clustering result.

In one embodiment, the modifying the initial clustering result according to the detection result corresponding to the user detection data set to obtain the target clustering result includes:

determining a detection result corresponding to the user detection data set according to a detection result in each user detection data in the user detection data set;

traversing the detection result corresponding to the user detection data set;

and correcting the clustering result corresponding to the user detection data corresponding to the traversed detection result according to the initial clustering result to obtain a target clustering result.

In one embodiment, the extracting document data from the user detection data according to the category corresponding to the detection result in the user detection data includes:

when the category corresponding to the detection result in the user detection data is a first category, determining first detection time according to the user detection data;

determining a first time period and a second detection time according to the first detection time and a first preset time length;

determining a second time period according to the second detection time and the first preset time length;

extracting first bill data in the first time period and second bill data in the second time period from the user detection data;

and when the first sheet data and the second sheet data are consistent, determining the first sheet data or the second sheet data as sheet data extracted from the user detection data.

when the category corresponding to the detection result in the user detection data is a second category, determining detection time according to the user detection data;

determining a preset time period according to the detection time and a second preset time length;

and extracting the bill data in the preset time period from the user detection data to be used as the bill data extracted from the user detection data.

In one embodiment, after extracting document data from the user detection data according to a category corresponding to a detection result in the user detection data, the method further includes:

constructing a document database according to the extracted document data;

when a target detection result which is sent by a terminal and corresponds to a user identification is received, candidate bill data matched with the target detection result is inquired from the bill database;

acquiring user basic information corresponding to the user identifier, and screening data from the candidate document data according to the user basic information;

inquiring user characteristic data corresponding to the user identification, and selecting target bill data from the screened bill data according to the user characteristic data;

and feeding back the target bill data to the terminal.

In one embodiment, the method further comprises:

acquiring target user detection data to be scored;

extracting a detection result from the target user detection data, and determining a category corresponding to the extracted detection result;

calling a bill extraction script file corresponding to the determined category, and extracting bill data to be scored from the target user detection data;

and inputting the document data to be scored into the trained document scoring model for prediction to obtain document scoring.

A document data extraction apparatus, the apparatus comprising:

the acquisition module is used for acquiring a user detection data set;

the clustering module is used for clustering the user detection data set according to the characteristic vector corresponding to each user detection data in the user detection data set to obtain an initial clustering result;

the correction module is used for correcting the initial clustering result according to the detection result corresponding to the user detection data set to obtain a target clustering result;

the determining module is used for determining the clustering cluster to which the detection result of the known category belongs based on the target clustering result;

the determining module is further configured to determine a category corresponding to each detection result according to the known category and the cluster;

and the extraction module is used for extracting the bill data from the user detection data according to the category corresponding to the detection result in the user detection data.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the document data extraction method of the various embodiments described above when executing the computer program.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the document data extraction method described in the various embodiments above.

According to the bill data extraction method, the bill data extraction device, the computer equipment and the storage medium, the corresponding user detection data sets are clustered based on the feature vectors corresponding to each user detection data to obtain the initial clustering result, and the initial clustering result is corrected according to the corresponding detection result to obtain the target clustering result with high accuracy. And determining the corresponding relation between the known category and the clustering clusters based on the target clustering result with higher accuracy and the detection result of the known category, and determining the category corresponding to each clustering cluster in the target clustering result according to the corresponding relation, thereby determining the category corresponding to each detection result. The clustering analysis is carried out based on the multi-dimensional characteristic vectors, and the category corresponding to each detection result is dynamically determined based on the corrected target clustering result, so that a more accurate category can be determined for each detection result, more accurate bill data can be extracted from the user detection data according to the category corresponding to the detection result in each user detection data, and the accuracy of bill data extraction can be improved.

Drawings

FIG. 1 is a diagram of an application scenario of a document data extraction method in one embodiment;

FIG. 2 is a schematic flow chart diagram of a document data extraction method in one embodiment;

FIG. 3 is a schematic flow chart diagram of a document data extraction method in another embodiment;

FIG. 4 is a schematic flow chart diagram of a document data extraction method in an exemplary embodiment;

FIG. 5 is a block diagram of a document data extraction mechanism in one embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The document data extraction method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 obtains a user detection data set, clusters the user detection data set according to a feature vector corresponding to each user detection data in the user detection data set to obtain an initial clustering result, corrects the initial clustering result according to a detection result corresponding to the user data set to obtain a target clustering result, determines a clustering cluster to which a detection result of a known category belongs based on the target clustering result, determines a category corresponding to each detection result corresponding to the user detection data set according to the known category and the corresponding clustering cluster, and extracts document data from corresponding user detection data according to the category corresponding to the detection result in each user detection data. It can be understood that, when receiving the target user detection data sent by the terminal 102, the server 104 determines a category corresponding to a detection result in the target user detection data according to the target clustering result, and extracts corresponding document data from the target user detection data according to the determined category. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a document data extraction method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:

s202, acquiring a user detection data set.

Wherein the user detection data set is a set consisting of a plurality of user detection data. The user detection data is data or detection data corresponding to the user and generated by triggering according to the user information and the corresponding detection result, and specifically, the user detection data may be data generated by determining a recommended product and corresponding product recommendation information according to the detection result of the user and triggering according to the detection result, the recommended product, the product recommendation information and the corresponding user information by a detection person or a detection mechanism. The test results are the result data obtained by one or more test routes. The recommended product is a product which is determined to be recommended to the user according to the detection result. The product recommendation information is related information corresponding to a recommended product, such as a product value and a cycle product value. The user information is basic information of the user, such as age and sex. Taking the user detection data as a case sample as an example, the detection result refers to a diagnosis result, the recommended product refers to a medicine prescribed according to the diagnosis result, and the product recommendation information includes, for example, the number or price of each medicine.

Specifically, the server detects a preset trigger condition, and when the preset trigger condition is detected, acquires a user detection data set from other devices locally or through network communication according to the detected preset trigger condition. The preset triggering condition is, for example, receiving a document data extraction instruction sent by the terminal, or detecting that the current time is consistent with the preset triggering time. The other device may be a server or a cluster of servers for storing user detection data.

And S204, clustering the user detection data sets according to the characteristic vector corresponding to each user detection data in the user detection data sets to obtain an initial clustering result.

Wherein the feature vector is a vector constructed from feature data extracted from the user detection data. Clustering is an unsupervised classification algorithm, and can cluster or classify a plurality of user detection data in a user detection data set into a plurality of cluster clusters through cluster analysis, wherein each cluster can be understood as a cluster or a class. The initial clustering result is a result obtained by clustering each user detection data in the user detection data set into a plurality of clustering clusters through clustering analysis. The initial clustering result specifies a cluster to which each user detection data in the user detection data set belongs.

Specifically, the server extracts user detection data from the user detection data set, determines a feature vector corresponding to each user detection data, and clusters the user detection data set according to the determined feature vectors through an unsupervised clustering algorithm to obtain an initial clustering result. It can be understood that the server may specifically obtain a feature vector set corresponding to the user detection data set according to the determined feature vector, perform clustering on the feature vector set through an unsupervised clustering algorithm to obtain an initial clustering result corresponding to the feature vector set, and determine an initial clustering result corresponding to the user detection data set according to the initial clustering result corresponding to the feature vector set.

In one embodiment, the server queries the pre-stored feature vectors according to each user detection data. It will be appreciated that a plurality of user detection data and a feature vector corresponding to each user detection data may be included in the user detection data set.

In one embodiment, the server matches each user detection data with a preset feature vector determination condition to determine a corresponding feature vector. The server can match each user detection data with a preset detection data list respectively so as to determine the feature vector corresponding to the user detection data according to the matching result. The preset detection data list comprises a plurality of user detection data and a feature vector corresponding to each user detection data.

In one embodiment, the server determines a feature vector corresponding to each user detection data according to feature data in the user detection data, and clusters a corresponding user detection data set according to the determined feature vector to obtain an initial clustering result.

And S206, correcting the initial clustering result according to the detection result corresponding to the user detection data set to obtain a target clustering result.

And the target clustering result is obtained by correcting the initial clustering result corresponding to the user detection data set according to the detection result. The plurality of user detection data corresponding to the same detection result in the target clustering result belong to the same clustering group, that is, the target clustering result designates the clustering group to which the corresponding user detection data belong for each detection result.

Specifically, the server determines a detection result corresponding to each user detection data in the user detection data set according to a detection result corresponding to each user detection data in the user detection data set. For each detection result corresponding to the user detection data set, the server modifies the clustering result of the user detection data corresponding to the detection result in the initial clustering result through a voting algorithm, that is, modifies the clustering cluster to which the user detection data corresponding to each detection result belongs in the initial clustering result through the voting algorithm, so that one or more user detection data corresponding to each detection result are clustered to the same clustering cluster, thereby modifying the initial clustering result and obtaining the final consistent target clustering result. It can be understood that in the initial clustering result, a plurality of user detection data corresponding to the same detection result may be clustered into different clustering clusters, and the plurality of user detection data clustered into different clustering clusters are corrected into the same clustering cluster by correcting the initial clustering result, so as to obtain a target clustering result with a consistent clustering result under the same detection result.

In one embodiment, step S206 includes: determining a detection result corresponding to the user detection data set according to a detection result in each user detection data in the user detection data set; traversing the detection result corresponding to the user detection data set; and correcting the clustering result corresponding to the user detection data corresponding to the traversed detection result according to the initial clustering result to obtain a target clustering result.

Specifically, the server respectively extracts detection results from each user detection data in the user detection data set, performs summary analysis on the extracted detection results with the detection results as dimensions, performs merging processing on the same detection results, and determines the merged detection results as the detection results corresponding to the user detection data set. The detection result corresponding to the user detection data set can be understood as the detection result included in the user detection data set. The server traverses each detection result corresponding to the user detection data set, and determines whether the user detection data corresponding to the traversed detection result belong to different cluster clusters, that is, determines whether a plurality of user detection data corresponding to the traversed detection result are clustered to different cluster clusters. For the detection result that the corresponding user detection data is clustered to the same cluster, the server keeps the cluster to which the user detection data corresponding to the detection result belongs unchanged, namely, no operation is performed on the user detection data corresponding to the detection result, and the next detection result is continuously traversed.

For the detection result that the corresponding user detection data is clustered to different clustering clusters, the server counts the number of the user detection data corresponding to the detection result clustered to each clustering cluster according to the initial clustering result, votes the detection result and/or the clustering cluster to which the user detection data corresponding to the detection result belongs based on the counted number, and clusters the detection result and/or the user detection data corresponding to the detection result into the clustering cluster with a large number of votes. And the server compares the statistical quantity corresponding to each cluster, votes the user detection data corresponding to the corresponding detection result and/or the detection result to the cluster with the largest statistical quantity, and corrects the user detection data which corresponds to the detection result in the initial cluster result and is clustered to other cluster to the cluster with the largest statistical quantity. And after the server finishes traversing the detection result corresponding to the user detection data set, obtaining a corrected clustering result as a target clustering result.

For example, assuming that there are A, B, C, D and E total 5 user detection data, the detection results in the 5 user detection data are all X, if the initial clustering result is obtained, the user detection data A, B and C are clustered into the first clustering cluster, the user detection data D and E are clustered into the second clustering cluster, based on the voting algorithm, the 5 user detection data corresponding to the detection result X are voted into the first clustering cluster, that is, the user detection data D and E are re-clustered into the first clustering cluster, and the 5 user history detection data in the obtained target clustering result are clustered into the first clustering cluster.

In the above embodiment, the initial clustering result is corrected by traversing each detection result and correcting the clustering result corresponding to the corresponding user detection data based on the traversed detection result, so that a final consistent target clustering result can be obtained, and the accuracy of the clustering result can be improved.

And S208, determining the cluster to which the detection result of the known class belongs based on the target cluster result.

The cluster is a cluster in the clustering result, and may also be understood as a cluster or a classification. The target clustering result and the initial clustering result both include the same plurality of clustering clusters. The category, i.e. the category of the detection result, refers to the category to which the detection result belongs or corresponds. Based on the characteristics of the detection results, a plurality of categories, such as a first category and a second category, are pre-configured for the detection results that may be present, i.e., each detection result is classified into a corresponding category based on the characteristics of the detection result. Taking the detection result as the diagnosis result as an example, according to the diagnosis period, the drug recommendation period and the like corresponding to the diagnosis result, the category corresponding to the detection result is divided into a first category and a second category, wherein the first category is chronic diseases, and the second category is non-chronic diseases.

Specifically, the server determines a detection result of a known category from detection results corresponding to the user detection data set, and matches the detection result of the known category with the target clustering result to determine a clustering cluster to which the detection result of the known category belongs in the target clustering result.

In one embodiment, the server matches one or more user detection data corresponding to the detection result of the known category with the target clustering result to determine the clustering cluster to which the detection result of the known category belongs.

S210, determining the corresponding category of each detection result according to the known categories and the cluster.

Specifically, the server determines the known category as a category corresponding to a cluster to which the corresponding detection result belongs, and determines the known category as a category corresponding to each detection result in the corresponding cluster, thereby determining a category corresponding to each detection result in the cluster. The server determines the category corresponding to each cluster in the target clustering result according to the plurality of pre-configured categories and the cluster clusters included in the target clustering result and the cluster clusters of which the corresponding categories are determined, and determines the category corresponding to each cluster as the category corresponding to each detection result in the cluster, thereby determining the category corresponding to each detection result.

In one embodiment, the pre-configured categories for the detection result include a first category and a second category. The target clustering result is a clustering result of the second classification, and comprises a first clustering cluster and a second clustering cluster. After the known class is determined as the cluster to which the corresponding detection result belongs according to the above manner, another class except the known class in the pre-configured classes is determined as the class corresponding to the cluster except the cluster corresponding to the known class in the target cluster. For example, if the known category is a first category and the cluster to which the detection result corresponding to the known category belongs is the first cluster, the first category is determined as the category corresponding to the first cluster, and the second category is determined as the category corresponding to the second cluster. And the server determines the classes corresponding to the detection results of the known classes and the detection results belonging to one cluster in the target clustering results as first classes, and determines the classes corresponding to other detection results in the target clustering results as second classes.

It is understood that for a clustering result with a higher dimension as the target clustering result, for example, the target clustering result includes three or more clustering clusters, the server may determine the category corresponding to each clustering cluster in the above manner based on a plurality of detection results of known categories.

In one embodiment, after determining the category corresponding to each detection result corresponding to the user detection data set, the server constructs a detection result category list or a detection result category list according to the plurality of detection results and the category corresponding to each detection result. The detection result type list/detection result type list comprises a plurality of detection results and corresponding types, so that when document data are extracted from user detection data subsequently, the type corresponding to the detection result in the user detection data can be rapidly determined based on the detection result type list or the detection result type list, and the corresponding document data are extracted according to the type.

And S212, extracting the bill data from the user detection data according to the type corresponding to the detection result in the user detection data.

Specifically, the server queries a pre-configured document extraction mode according to the category corresponding to the detection result in each piece of user detection data, and extracts document data from the corresponding piece of user detection data according to the queried document extraction mode.

In one embodiment, the server queries a pre-configured document extraction script file according to a category corresponding to a detection result in each piece of user detection data, runs the queried document extraction script file, and extracts document data from corresponding user detection data respectively by means of the run document extraction script file. Corresponding document extraction logic is preconfigured in the document extraction script file, so that document data in user detection data can be extracted by the running document extraction script file based on the preconfigured document extraction logic.

In one embodiment, after extracting the document data from each piece of user detection data in the user detection data set, the server constructs a document database according to the extracted document data, so that when document data is recommended, more accurate document data can be determined based on the document database for recommendation.

According to the bill data extraction method, the corresponding user detection data sets are clustered based on the feature vector corresponding to each user detection data to obtain an initial clustering result, and the initial clustering result is corrected according to the corresponding detection result to obtain a target clustering result with high accuracy. And determining the corresponding relation between the known category and the clustering clusters based on the target clustering result with higher accuracy and the detection result of the known category, and determining the category corresponding to each clustering cluster in the target clustering result according to the corresponding relation, thereby determining the category corresponding to each detection result. The clustering analysis is carried out based on the multi-dimensional characteristic vectors, and the category corresponding to each detection result is dynamically determined based on the corrected target clustering result, so that a more accurate category can be determined for each detection result, more accurate bill data can be extracted from the user detection data according to the category corresponding to the detection result in each user detection data, and the accuracy of bill data extraction can be improved.

In one embodiment, step S204 includes: extracting feature data from each user detection data in the user detection data set; determining a feature vector corresponding to the user detection data according to the feature data; and clustering the user detection data set according to the characteristic vector to obtain an initial clustering result.

The characteristic data comprises a product recommendation period, a year product recommendation frequency, a period product numerical value stability degree and the like. Taking the user detection data as a case sample as an example, the product recommendation period is a treatment period, the annual product recommendation frequency is an annual visit and medicine taking frequency, and the period product value stability is a period cost stability. The product recommendation period may be an average period calculated by year, quarter or month, such as 3 times a quarter recommending a product, and then a month. Each feature data extracted from the same user detection data corresponds to one vector element in the feature vector, that is, each extracted feature data is used as one vector element of the feature vector, or one vector element in the feature vector is determined according to each extracted feature data. For example, assume that feature data extracted from user detection data is: the product recommendation period is 1, the annual product recommendation frequency is 12, the period product numerical value stability degree is stable, and the vector element corresponding to stability is 1, then the correspondingly determined feature vector is (1, 12, 1), the characterization form of the feature vector is only an example, and is not used for specifically limiting the characterization form of the feature vector.

Specifically, the server extracts user detection data from a user detection data set, and extracts a plurality of feature data from each user detection data. And the server determines a feature vector corresponding to the user detection data according to a plurality of feature data extracted from each user detection data. And the server clusters the corresponding user detection data sets according to the characteristic vector corresponding to each user detection data to obtain an initial clustering result.

In one embodiment, the server performs clustering according to the determined feature vector through an unsupervised clustering algorithm to obtain a two-class initial clustering result corresponding to the user detection data set. The initial clustering result of the second classification includes two clustering clusters of the second classification.

In one embodiment, each vector element in the feature vector corresponds to one feature data. And the server determines corresponding vector elements according to the feature data extracted from the user detection data, so as to obtain the feature vector corresponding to the user detection data.

In one embodiment, the step of the server determining the corresponding vector elements from the periodic product values comprises: the server extracts each periodic product value from the user detection data, analyzes the extracted periodic product values to determine the stability of the periodic product values, and determines the vector elements corresponding to the periodic product values according to the stability. For example, if the value of the periodic product is unstable, the corresponding vector element is 0, otherwise, it is 1. The stability can be determined by calculating the variance or standard deviation of each cycle product value, or by calculating whether the increase or decrease between two adjacent cycle product values exceeds a preset threshold.

In one embodiment, the server determines a plurality of feature vectors corresponding to the user detection data set according to the feature vector corresponding to each user detection data set, and determines two clustering centers according to the plurality of feature vectors corresponding to the user detection data set, wherein each clustering center corresponds to one clustering cluster. And the server respectively calculates the distance between each feature vector and the selected two clustering centers, and divides each feature vector into clustering clusters with closer distances according to the calculated distance. And the server respectively recalculates respective clustering centers of the two clustering clusters, respectively calculates the distance between each feature vector and the two reselected clustering centers, and re-divides the clustering cluster to which each feature vector belongs according to the calculated distance until the two reselected clustering centers are not changed any more, and stops iteration to obtain the initial clustering result of the second classification. The initial clustering result of the second classification is an initial clustering result corresponding to a plurality of feature vectors corresponding to the user detection data set, that is, the initial clustering result designates a clustering cluster to which each feature vector belongs, and a plurality of corresponding user detection data or an initial clustering result corresponding to the user detection data set can be determined according to the initial clustering result. It can be understood that, in the clustering process, the server may dynamically determine and adjust a cluster to which each feature vector and/or user detection data corresponding to each feature vector belong, and thus, an initial clustering result obtained by clustering specifies a cluster to which each feature vector and/or user detection data corresponding to each feature belong.

In the above embodiment, the corresponding feature vector is determined according to the feature data extracted from the user detection data, and the corresponding user detection data set is clustered according to the determined feature vector, so that a relatively accurate initial clustering result can be obtained.

In one embodiment, step S212 includes: when the category corresponding to the detection result in the user detection data is a first category, determining first detection time according to the user detection data; determining a first time period and a second detection time according to the first detection time and a first preset time length; determining a second time period according to the second detection time and the first preset time length; extracting first bill data in a first time period and second bill data in a second time period from the user detection data; and when the first sheet data and the second sheet data are consistent, determining the first sheet data or the second sheet data as sheet data extracted from the user detection data.

The first detection time refers to a first detection time determined according to the user detection data, and may specifically be a first detection time within a natural year, where the natural year refers to a complete year calculated in a common calendar. The first predetermined period of time is a predetermined length of time, such as 3 months. A time period is a time interval determined by a specified start time and time length. Taking the user detection data as a case sample as an example, the first detection time refers to the first visit time determined according to the case sample, and the first category may specifically be chronic diseases.

Specifically, after determining the category corresponding to the detection result in each user detection data in the user detection data set, the server determines the category corresponding to the detection result as the category corresponding to the corresponding user detection data, and screens out the user detection data of which the corresponding category is the first category from the user detection data set. And the server determines corresponding first detection time according to the screened user detection data respectively and inquires a pre-configured first preset time. For each user detection data screened out, the server determines a first time period with the time length equal to a first preset time length by taking the first detection time as the starting time, determines the ending time of the first time period as a second detection time, and determines a second time period with the time length equal to the first preset time length by taking the second detection time as the starting time. And the server respectively extracts the first bill data in the first time period and the second bill data in the second time period from the corresponding user detection data according to the determined first time period and second time period, and compares the extracted first bill data with the extracted second bill data. And when the first bill data and the second bill data are consistent, the bill data in the first time period and the second time period are stable, and the server determines the consistent first bill data or second bill data as the bill data extracted from the corresponding user detection data. And when the first bill data is inconsistent with the second bill data, the server abandons the extracted first bill data and/or second bill data. And the server executes the operation aiming at each user detection data of the first category to obtain the extracted bill data.

In one embodiment, the server determines a first time period determined according to the user detection data as a first product recommendation period, determines a second time period as a second product recommendation period, and extracts a product recommended in each recommendation period and corresponding product recommendation information from the user detection data according to the above manner, so as to obtain document data in each recommendation period.

In one embodiment, the server determines a first time period with a time length equal to a first preset time length by taking the first detection time as a start time, determines a second time period with a time length equal to the first preset time length by taking an end time of the first time period as a start time of the second time period, and so on, and determines a plurality of time periods corresponding to each user detection data. And the server respectively extracts corresponding bill data in each time period from the user detection data, respectively compares the bill data in two adjacent time periods, and determines the consistent bill data as the bill data extracted from the user detection data according to the comparison result. Therefore, a plurality of corresponding bill data can be extracted from each user detection data. It is to be understood that the starting time of the second time period may also be the earliest detection time after the first time period.

In the embodiment, for the first type of user detection data, the document data is extracted from the user detection data by comparing the document data in the adjacent time periods, the characteristics of the first type of user detection data are fully considered, and the document data extraction accuracy can be improved.

In one embodiment, step S212 includes: when the category corresponding to the detection result in the user detection data is a second category, determining detection time according to the user detection data; determining a preset time period according to the detection time and a second preset time length; and extracting the bill data in a preset time period from the user detection data to be used as the bill data extracted from the user detection data.

Specifically, after determining the category corresponding to each user detection data in the user detection data set, the server screens out the user detection data with the category of the second category from the user detection data set, and determines corresponding detection time according to each screened user detection data. For each piece of screened user detection data, the server determines a preset time period by taking corresponding detection time as starting time and a second preset time length in advance as time length, and extracts the bill data in the preset time period from the corresponding user detection data based on the determined preset time period to serve as the bill data extracted from the user detection data.

In one embodiment, for each piece of screened user detection data, the server determines one or more corresponding detection times according to the user detection data, determines one or more corresponding preset time periods respectively with each detection time as a starting time and a second preset time duration as a time length, and extracts the bill data in each preset time period from the user detection data as one or more bill data extracted from the user detection data. In one embodiment, the second category may specifically be non-chronic diseases.

In the embodiment, for the second category of user detection data, based on the characteristics of the user detection data, the document data in a single preset time period is determined as the document data extracted from the user detection data, and on the premise of ensuring the document data extraction accuracy, the document data extraction efficiency is improved.

In an embodiment, after step S212, the method for extracting document data further includes: constructing a document database according to the extracted document data; when a target detection result which is sent by a terminal and corresponds to a user identification is received, candidate bill data matched with the target detection result is inquired from a bill database; acquiring user basic information corresponding to the user identifier, and screening data from the candidate document data according to the user basic information; inquiring user characteristic data corresponding to the user identification, and selecting target document data from the screened document data according to the user characteristic data; and feeding back the target document data to the terminal.

The document database is a set composed of a plurality of document data, and can be specifically constructed according to document data extracted from each user detection data in the user detection data set. The user basic information is corresponding basic information of the user, such as height, weight, age, sex, and the like. The user characteristic data is data which can be used for representing user characteristics and is obtained according to user data analysis, for example, the product value corresponding to an expected recommended product is low, or the matching degree of the recommended product and a detection result is high, and therefore the corresponding bill data/products can be recommended according to the self condition of the user. Taking document data as a prescription as an example, low product value means low treatment cost, and high matching degree of the recommended product and the detection result means good treatment effect. It will be appreciated that the user characteristic data may be user portrait data derived from user portrayal of the user data.

Specifically, after extracting corresponding document data from each piece of user detection data in the user detection data set, the server constructs a document database according to the extracted document data and a detection result corresponding to each document data. And when receiving a target detection result which is sent by the terminal and corresponds to the user identification, the server matches the received target detection result with the bill database so as to inquire candidate bill data matched with the target detection result from the bill database. And the server acquires corresponding user basic information from local or other equipment according to the user identification, and screens the bill data matched with the user basic information from the candidate bill data. The server inquires the pre-configured user characteristic data according to the user identification, selects target document data from the screened document data according to the inquired user characteristic data, and feeds back the selected target document data to the terminal. And the terminal displays the received target document data to a corresponding operator so that the operator can refer to the document data corresponding to the target detection result when determining the document data.

In one embodiment, the server obtains user data from a local or other device according to the user identifier, and analyzes the user data to obtain corresponding user characteristic data. The user characteristic data may be characteristic data which is obtained by the server in advance according to the acquired user data and is stored locally.

In one embodiment, when receiving target user detection data including a target detection result sent by a terminal, a server determines the target detection result, a user identifier, user basic information and document data according to the target user detection data, inquires matched candidate document data from a document database according to the target detection result, screens data from the inquired candidate document data based on the user basic information, inquires pre-stored user characteristic data according to the user identifier, selects the target document data from the screened document data according to the user characteristic data, and feeds the selected target document data back to the terminal. The document data determined according to the target user detection data is audited according to the selected target document data, and the audit result and/or the target document data are fed back to the terminal.

In one embodiment, the server scores each document data in the document database by means of a trained document scoring model, and stores the obtained document score and the corresponding positioned document data in the document database in an associated manner, so that the target document data matched with the target detection result can be inquired from the document database according to the document score. The server can select the target document data with the highest score from the plurality of target document data according to the target detection result, the user basic information and the user characteristic data after selecting the plurality of target document data from the document database according to the document score and feeds the target document data back to the terminal.

In one embodiment, the target document data fed back to the terminal by the server includes a product identifier list composed of one or more product identifiers, may further include product description data corresponding to each product identifier, and may further include a product value, a cycle product value, an evaluation result, and the like. Taking target document data as an example of prescription, product identification such as medicine identification, product description data such as medicine usage, product value such as medicine price, cycle product value such as cycle treatment cost, and evaluation result such as treatment effect.

In one embodiment, after extracting document data from each piece of user detection data in the user detection data set, the server determines an evaluation index corresponding to each piece of extracted document data based on the corresponding piece of user detection data according to a preset index extraction mode, and uses the determined evaluation index as an evaluation tag of the corresponding document data, so that when the corresponding target document data is fed back according to a target detection result, higher-quality target document data can be screened according to the evaluation tag. The preset index extraction mode is to extract detection mechanism change data corresponding to the document data based on user detection data, determine a detection mechanism change mode according to the detection mechanism change data, and determine an evaluation index according to the detection mechanism change mode. An assessment index such as a therapeutic efficacy index, detection mechanism change data such as referral data, and a detection mechanism change mode such as referral mode.

For example, if the detection mechanism change mode is up, the evaluation index of the document data is determined to be level C, if the detection mechanism change mode is flat turn, the evaluation index is determined to be level B, and if the detection mechanism change mode is down, the evaluation index is determined to be level a.

In one embodiment, for the first category of detection results, the server may determine whether the document data remains unchanged for a preset time period based on the user detection data, and if so, determine that the corresponding evaluation index is level a. For the second category of detection results, the server may determine an evaluation index for the corresponding document data based on a repeated detection rate over a preset time period. For the product value index, the server can sort a plurality of document data corresponding to the same detection result according to the periodic product value based on the document database. A first category such as chronic disease and a second category such as non-chronic disease.

In the embodiment, the corresponding user basic information and the corresponding user characteristic data are obtained according to the user identification corresponding to the target detection result, the target document data matched with the target detection result, the user basic information and the user characteristic data are inquired from the established document database and serve as the target document data recommended to the terminal, and the accuracy and the efficiency of the recommended document data can be improved.

In one embodiment, the document data extraction method further includes: acquiring target user detection data to be scored; extracting a detection result from the target user detection data, and determining the category corresponding to the extracted detection result; calling a document extraction script file corresponding to the determined category, and extracting document data to be scored from the target user detection data; and inputting the document data to be scored into the trained document scoring model for prediction to obtain document scoring.

The document extraction script file is a script program or function used for extracting document data from user detection data. The document scoring model is obtained by model training based on the training sample set and can be used for scoring document data.

Specifically, the server detects a specified trigger condition, acquires target user detection data to be scored according to the detected specified trigger condition when the specified trigger condition is detected, and extracts a detection result from the acquired target user detection data. And the server matches the extracted detection result with a detection result category list or a detection result category list constructed according to the user detection data set so as to determine the category corresponding to the extracted detection result. The detection result category list/detection result category list is constructed and stored in advance according to the detection results corresponding to the user detection data set and the categories corresponding to each detection result. And the server inquires a pre-configured document extraction script file according to the determined category and calls the inquired document extraction script file to extract document data to be scored from the target user detection data. And the server inputs the document data to be scored into the trained document scoring model for prediction to obtain corresponding document scores.

The appointed triggering condition refers to a condition which is appointed in advance and used for triggering document scoring operation, for example, a document scoring request sent by a terminal is received, or the current time is detected to be consistent with the appointed triggering time, or newly added user detection data is detected, and the newly added user detection data is determined as target user detection data to be scored.

In one embodiment, after the document data is extracted from each piece of user detection data in the user detection data set by the server, model training is performed according to the extracted document data to obtain a trained document scoring model.

In the embodiment, based on the plurality of detection results of the determined corresponding categories, the categories corresponding to the detection results in the target user detection data can be quickly determined, so that the document extraction script files corresponding to the categories are called to quickly extract document data to be scored, and the extraction efficiency of the document data can be improved. The document data extracted is scored by means of the document scoring model, and the scoring accuracy and efficiency can be improved.

As shown in fig. 3, in an embodiment, a document data extraction method is provided, which specifically includes the following steps:

s302, a user detection data set is obtained.

S304, feature data is extracted from each user detection data in the user detection data set.

And S306, determining a feature vector corresponding to the user detection data according to the feature data.

And S308, clustering the user detection data set according to the characteristic vector to obtain an initial clustering result.

S310, determining a detection result corresponding to the user detection data set according to the detection result in each user detection data in the user detection data set.

And S312, traversing the detection result corresponding to the user detection data set.

And S314, correcting the clustering result corresponding to the user detection data corresponding to the traversed detection result according to the initial clustering result to obtain a target clustering result.

And S316, determining the cluster to which the detection result of the known class belongs based on the target cluster result.

And S318, determining the corresponding category of each detection result according to the known categories and the cluster.

S320, when the category corresponding to the detection result in the user detection data is the first category, determining first detection time according to the user detection data.

S322, determining a first time period and a second detection time according to the first detection time and a first preset time length.

And S324, determining a second time period according to the second detection time and the first preset time period.

S326, extracting first bill data in a first time period and second bill data in a second time period from the user detection data.

And S328, when the first sheet data and the second sheet data are consistent, determining the first sheet data or the second sheet data as the sheet data extracted from the user detection data.

S330, when the category corresponding to the detection result in the user detection data is the second category, determining the detection time according to the user detection data.

S332, determining a preset time period according to the detection time and the second preset time length.

And S334, extracting the bill data in the preset time period from the user detection data to serve as the bill data extracted from the user detection data.

As shown in fig. 4, in a specific embodiment, a document data extraction method is provided, which takes a case sample set composed of a plurality of case samples as an example of a user detection data set, and each case sample is characterized by a combination of user identification (such as a, b, c …) and diagnosis results (a, b, c …), such as a-a. Wherein, the diagnosis result B is assumed to be a detection result of a known class, the class of the diagnosis result B is chronic disease, the diagnosis result B is diabetes, the class of the diagnosis result can be understood as a disease type, and the pre-configured classes comprise chronic disease and non-chronic disease. Therefore, the processing flow for extracting prescription data from each case sample based on the mode of combining clustering and voting algorithm is as follows, and the processing data is also called document data:

assume that the acquired case sample set includes case samples a-A, b-A, c-A, d-A, e-B, f-B, g-B, h-B, i-B, j-C, k-C, and l-C. And the server clusters the case sample set according to the characteristic data in each case sample or the characteristic vector corresponding to the case sample determined according to the characteristic data to obtain an initial clustering result. And the server traverses each diagnosis result corresponding to the case sample set by taking the diagnosis result as a dimension, and corrects the initial clustering result according to the traversed diagnosis result based on a voting algorithm to obtain a target clustering result. The server determines the category corresponding to each cluster in the target cluster result based on the diagnosis result of the known category and the cluster to which the diagnosis result belongs, and determines the category corresponding to each diagnosis result based on the category corresponding to the cluster, thereby determining that the category corresponding to the cluster to which the diagnosis result B belongs is chronic disease, and determining that the category corresponding to the other cluster is non-chronic disease. After determining the category corresponding to each diagnosis result, the server inquires a prescription extraction mode pre-configured for the chronic disease, and extracts prescription data from each case sample of which the corresponding diagnosis result is the chronic disease based on the prescription extraction mode. Accordingly, the server queries a prescription extraction style pre-configured for the non-chronic disease and extracts prescription data from each case sample for which the corresponding diagnosis result is the non-chronic disease based on the prescription extraction style.

In one embodiment, the server may construct a prescription database from prescription data extracted from each case sample in the set of case samples.

It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 5, there is provided a document data extraction apparatus 500 comprising: an obtaining module 502, a clustering module 504, a modifying module 506, a determining module 508, and an extracting module 510, wherein:

an obtaining module 502, configured to obtain a user detection data set;

a clustering module 504, configured to cluster the user detection data sets according to a feature vector corresponding to each user detection data in the user detection data sets to obtain an initial clustering result;

a correcting module 506, configured to correct the initial clustering result according to a detection result corresponding to the user detection data set to obtain a target clustering result;

a determining module 508, configured to determine, based on the target clustering result, a clustering cluster to which a detection result of a known category belongs;

the determining module 508 is further configured to determine a category corresponding to each detection result according to the known category and the cluster;

and the extracting module 510 is configured to extract document data from the user detection data according to a category corresponding to a detection result in the user detection data.

In one embodiment, the clustering module 504 is further configured to extract feature data from each user detection data in the user detection data set; determining a feature vector corresponding to the user detection data according to the feature data; and clustering the user detection data set according to the characteristic vector to obtain an initial clustering result.

In an embodiment, the modifying module 506 is further configured to determine a detection result corresponding to the user detection data set according to a detection result in each user detection data in the user detection data set; traversing the detection result corresponding to the user detection data set; and correcting the clustering result corresponding to the user detection data corresponding to the traversed detection result according to the initial clustering result to obtain a target clustering result.

In an embodiment, the extracting module 510 is further configured to determine a first detection time according to the user detection data when a category corresponding to a detection result in the user detection data is a first category; determining a first time period and a second detection time according to the first detection time and a first preset time length; determining a second time period according to the second detection time and the first preset time length; extracting first bill data in the first time period and second bill data in the second time period from the user detection data; and when the first sheet data and the second sheet data are consistent, determining the first sheet data or the second sheet data as sheet data extracted from the user detection data.

In an embodiment, the extracting module 510 is further configured to determine, when the category corresponding to the detection result in the user detection data is a second category, a detection time according to the user detection data; determining a preset time period according to the detection time and a second preset time length; and extracting the bill data in the preset time period from the user detection data to be used as the bill data extracted from the user detection data.

In an embodiment, the document data extraction apparatus 500 further includes: a recommendation module;

the recommendation module is used for constructing a document database according to the extracted document data; when a target detection result which is sent by a terminal and corresponds to a user identification is received, candidate bill data matched with the target detection result is inquired from the bill database; acquiring user basic information corresponding to the user identifier, and screening data from the candidate document data according to the user basic information; inquiring user characteristic data corresponding to the user identification, and selecting target bill data from the screened bill data according to the user characteristic data; and feeding back the target bill data to the terminal.

In an embodiment, the document data extraction apparatus 500 further includes: a scoring module;

the scoring module is used for acquiring target user detection data to be scored; extracting a detection result from the target user detection data, and determining a category corresponding to the extracted detection result; calling a bill extraction script file corresponding to the determined category, and extracting bill data to be scored from the target user detection data; and inputting the document data to be scored into the trained document scoring model for prediction to obtain document scoring.

For specific limitations of the document data extraction device, reference may be made to the above limitations of the document data extraction method, which are not described herein again. All or part of the modules in the document data extraction device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing a user detection data set and detection results of known classes, and the known classes. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a document data extraction method.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the document data extraction method of the various embodiments described above when the computer program is executed by the processor.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the document data extraction method in the various embodiments described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A document data extraction method, the method comprising:

acquiring a user detection data set;

2. The method of claim 1, wherein the clustering the user detection data set according to the feature vector corresponding to each user detection data in the user detection data set to obtain an initial clustering result comprises:

3. The method according to claim 1, wherein the modifying the initial clustering result to obtain a target clustering result according to the detection result corresponding to the user detection data set comprises:

traversing the detection result corresponding to the user detection data set;

4. The method according to claim 1, wherein the extracting document data from the user detection data according to the category corresponding to the detection result in the user detection data comprises:

5. The method according to claim 1, wherein the extracting document data from the user detection data according to the category corresponding to the detection result in the user detection data comprises:

6. The method according to any one of claims 1 to 5, wherein after extracting document data from the user detection data according to a category corresponding to a detection result in the user detection data, the method further comprises:

constructing a document database according to the extracted document data;

and feeding back the target bill data to the terminal.

7. The method according to any one of claims 1 to 5, further comprising:

acquiring target user detection data to be scored;

8. A document data extraction apparatus, the apparatus comprising:

the acquisition module is used for acquiring a user detection data set;

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.