CN111104481B

CN111104481B - Method, device and equipment for identifying matching field

Info

Publication number: CN111104481B
Application number: CN201911304454.7A
Authority: CN
Inventors: 冯仓龙
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2023-10-10
Anticipated expiration: 2039-12-17
Also published as: CN111104481A

Abstract

The embodiment of the application discloses a method, a device and equipment for identifying a matching field. Because the representation forms of the field to be identified and the target field are not uniform, the field to be identified and the target field cannot be directly matched, but the data item which can represent the field to be identified is matched with the target field, so that the identification of the matched field is realized. In addition, different recognition modes are adopted for different target fields, so that the efficiency of recognizing the data items is improved.

Description

Method, device and equipment for identifying matching field

Technical Field

The present application relates to the field of information technology processing, and in particular, to a method, an apparatus, and a device for identifying a matching field.

Background

With the rapid popularization and development of internet technology, a large amount of data is generated in various application fields. The data generated in the same field is different in expression form due to different configurations of different users, for example, multiple description modes exist for the same thing, so that the data in the database are personalized. In practical application, when searching the same type of data from the database, the required data cannot be searched due to personalized performance of the data.

Disclosure of Invention

In view of this, the embodiments of the present application provide a method, apparatus and device for identifying matching fields, so as to implement accurate query of matching fields.

In order to solve the above problems, the technical solution provided by the embodiment of the present application is as follows:

a method of identifying matching fields, the method comprising:

determining the identification mode of the target field;

identifying the data item of the field to be identified by utilizing the identification mode, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field;

determining whether the field to be identified is matched with the target field according to an identification result of whether the data item of the field to be identified is matched with the target field;

and determining the field to be identified matched with the target field as a matched field of the target field.

In one possible implementation manner, the identifying the data item of the field to be identified by using the identifying manner, and obtaining an identifying result of whether the data item of the field to be identified matches the target field includes:

when the recognition mode is recognition by adopting a deep learning model, acquiring a target deep learning model corresponding to the target field; the target deep learning model corresponding to the target field is obtained by training according to positive sample data and negative sample data, the positive sample data is characteristic representation of a data item matched with the target field, and the negative sample data is characteristic representation of a data item not matched with the target field;

Generating a characteristic representation of the data item of the field to be identified;

and inputting the characteristic representation of the data item of the field to be identified into a target deep learning model corresponding to the target field, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field.

In one possible implementation manner, the generating the characteristic representation of the data item of the field to be identified includes:

extracting text features of data items of a field to be identified, wherein the text features comprise one or more of character features, inter-character features, word features and inter-word features;

calculating the matching degree characteristics of the data items of the field to be identified and each training text set;

and forming the characteristic representation of the data item of the field to be identified by the text characteristic of the data item of the field to be identified and the matching degree characteristic of the data item of the field to be identified and each training text set.

In one possible implementation, the text feature of the data item of the field to be identified includes any one or a combination of the following:

according to the character characteristics obtained through the text training of the medical data, converting each character of the data item of the field to be identified into a first character characteristic value, and determining the first character characteristic value of each character of the data item of the field to be identified as the character characteristics of the data item of the field to be identified;

Extracting single words or multiple words adjacent to a first target word and within a preset range of the first target word to form a first word group, converting the first word group into a first word group characteristic value according to word characteristics obtained through medical data text training, determining the first word group characteristic value as the position characteristic of the first target word, determining the position characteristic of each first target word as the inter-word position characteristic of a data item of a field to be identified, and respectively taking each word in the data item of the field to be identified by the first target word;

performing word segmentation on the data item of the field to be identified, converting each word segment of the data item of the field to be identified into a first word characteristic value according to word characteristics obtained through medical data text training, and determining each first word characteristic value as the word characteristic of the data item of the field to be identified;

the method comprises the steps of segmenting a data item of a field to be identified, extracting a second target segmentation word which is adjacent to a first target segmentation word and is in a preset range of the first target segmentation word, converting the second target segmentation word into a second word characteristic value according to word characteristics obtained through medical data text training, determining the second word characteristic value as the position characteristics of the first target segmentation word, determining the position characteristics of each first target segmentation word as the inter-word position characteristics of the data item of the field to be identified, and respectively taking each segmentation word in the data item of the field to be identified by the first target segmentation word.

In a possible implementation manner, the calculating the matching degree characteristic of the data item of the field to be identified and each training text set includes:

acquiring a matching value of a data item of the field to be identified and a j-th data item in an i-th training text set, wherein i and j are positive integers, and each training text set comprises data items of the same category;

calculating the matching degree value of the data item of the field to be identified and the ith training text set according to the matching value of the data item of the field to be identified and the jth data item in the ith training text set;

and determining the matching degree value of the data item of the field to be identified and each training text set as the matching degree characteristic of the data item of the field to be identified and each training text set.

In one possible implementation manner, the training process of the target deep learning model corresponding to the target field includes:

acquiring a data item matched with the target field, generating a characteristic representation of the data item matched with the target field, and determining the characteristic representation of the data item matched with the target field as positive sample data;

acquiring a data item which is not matched with the target field, generating a characteristic representation of the data item which is not matched with the target field, and determining the characteristic representation of the data item which is not matched with the target field as negative sample data;

And training according to the positive sample data and the negative sample data to obtain a target deep learning model corresponding to the target field.

when the recognition mode is character matching recognition, acquiring keywords corresponding to the target field;

matching the data item of the field to be identified with the keyword corresponding to the target field;

if the data item of the field to be identified is matched with the keyword corresponding to the target field, acquiring an identification result of the data item of the field to be identified matched with the target field;

and if the data item of the field to be identified is not matched with the keyword corresponding to the target field, acquiring an identification result of the data item of the field to be identified, which is not matched with the target field.

When the identification mode is regular rule matching identification, a regular rule corresponding to the target field is obtained;

judging whether the data item of the field to be identified meets the regular rule corresponding to the target field;

if the data item of the field to be identified meets the regular rule corresponding to the target field, acquiring an identification result of matching the data item of the field to be identified with the target field;

and if the data item of the field to be identified does not meet the regular rule corresponding to the target field, acquiring an identification result of mismatching of the data item of the field to be identified and the target field.

In one possible implementation manner, the determining whether the field to be identified matches the target field according to the identification result of whether the data item of the field to be identified matches the target field includes:

and in the recognition results of whether the randomly selected multiple data items in the field to be recognized are matched with the target field, if the recognition results matched with the target field are more than the recognition results not matched with the target field, determining that the field to be recognized is matched with the target field, and if the recognition results matched with the target field are less than or equal to the recognition results not matched with the target field, determining that the field to be recognized is not matched with the target field.

In one possible implementation, the method further includes:

determining a target data table in which a target field is located;

searching a data table matched with the target data table in the data table to be identified;

and determining the fields in the data table matched with the target data table as fields to be identified.

An apparatus to identify matching fields, the apparatus comprising:

a first determining unit, configured to determine an identification manner of the target field;

the acquisition unit is used for identifying the data item of the field to be identified by utilizing the identification mode and acquiring an identification result of whether the data item of the field to be identified is matched with the target field;

the second determining unit is used for determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field;

and a third determining unit, configured to determine a field to be identified that matches the target field as a matching field of the target field.

A computer readable storage medium having instructions stored therein which, when executed on a terminal device, cause the terminal device to perform the method of identifying matching fields.

An apparatus for identifying matching fields, comprising: the processor is used for realizing the method for identifying the matching field when executing the computer program.

From this, the embodiment of the application has the following beneficial effects:

in this embodiment, the identification manner of the target field is first determined, and then the data item of the field to be identified is identified by using the identification manner, so as to obtain an identification result of whether the data item of the field to be identified is matched with the target field. And then, according to the recognition result of whether the data item of the field to be recognized is matched with the target field, determining whether the field to be recognized is matched with the target field, so as to determine the field to be recognized matched with the target field as a matching field of the target field.

That is, when the matching field is identified, the embodiment of the application determines whether the data item corresponding to the field to be identified is matched with the target field, and then determines whether the field to be identified is matched with the target field according to the identification result of whether the data item corresponding to the field to be identified is matched with the target field. Because the representation forms of the field to be identified and the target field are not uniform, the field to be identified and the target field cannot be directly matched, but the data item which can represent the field to be identified is matched with the target field, so that the identification of the matched field is realized. In addition, different recognition modes are adopted for different target fields, so that the efficiency of recognizing the data items is improved.

Drawings

FIG. 1 is a flowchart of a method for identifying matching fields according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for obtaining a recognition result according to an embodiment of the present application;

FIG. 3 is a flowchart of another method for obtaining a recognition result according to an embodiment of the present application;

FIG. 4 is a flowchart of another method for obtaining a recognition result according to an embodiment of the present application;

fig. 5 is a block diagram of a device for identifying a matching field according to an embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of embodiments of the application will be rendered by reference to the appended drawings and appended drawings.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, a description is first given of the background technology related to the embodiments of the present application.

Because of the different configurations of different users in the same field, the produced data are different in expression form, for example, a plurality of description modes exist for the same thing, so that the data in the database are personalized. In practical application, when searching the same type of data from the database, the required data cannot be searched due to personalized performance of the data.

In particular, in the medical field, different health institutions employ hospital information systems (Hospital Information System, HIS) to manage medical data of the unit. However, the configuration formats of HIS systems are different from vendor to vendor, resulting in a large variance in the medical data formats that are formed. When the health information platform extracts the required medical data from the data tables of the HIS systems of the institutions, the fields of the data tables in the health information platform are standard fields, and the fields of the data tables of the HIS systems are non-standard fields with great differences, so that the required fields cannot be quickly searched from the data tables of the HIS systems.

Based on the above, the embodiment of the application provides a method for identifying a matching field, which comprises the steps of firstly determining a target field and an identification mode corresponding to the target field, and identifying a data item of a field to be identified by using the identification mode to obtain an identification result of whether the data item of the field to be identified is matched with the target field. Namely, the data item which can represent the field to be identified is matched with the target field to obtain the identification result. And then, according to the recognition result of whether the data item of the field to be recognized is matched with the target field, determining whether the field to be recognized is matched with the target field, so as to determine the field to be recognized matched with the target field as a matching field of the target field. It can be seen that, although the target field is not unified with the field to be identified, the data item corresponding to the field to be identified may be used to match the target field, so as to determine the matching field corresponding to the target field.

The field may be understood as information characterizing a service data category in the data table, for example, a disease field, a medical insurance category field, and a medical order field, where data items of the field are specific information to be filled in the HIS system, for example, data items corresponding to the disease field are chronic bronchitis, alzheimer's disease, asthma, and the like.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, a method for identifying matching fields provided by the embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 1, the flowchart of a method for identifying matching fields according to an embodiment of the present application is shown in fig. 1, where the method may include:

s101: the identification mode of the target field is determined.

In this embodiment, the target field may be understood as a field that needs to be subjected to field matching, and in practical application, the target field may be any standard field in a data table of the health information platform. When the target field is determined, determining an identification mode corresponding to the target field, so that the identification mode is used for identifying the data item corresponding to the field to be identified later.

It can be understood that, since different fields have different characteristics, different identification modes can be set in advance for different fields in order to make full use of the characteristics of the fields to identify subsequent data items. Specifically, a deep learning model recognition mode, such as a disease field, a medical institution field, a medicine field and the like, can be adopted for the field with stronger professionals; for fields with fixed content of data items in the fields, character matching recognition modes can be adopted, such as medical insurance category fields, registration category fields and the like; a field having a specific rule for a data item in the field, for example, having content such as digital information, may be provided with a regular rule identification means, for example, a contact means field, a date of birth field, a time of surgery field, etc. The specific implementation of each recognition method will be described in the following embodiments.

S102: and identifying the data item of the field to be identified by utilizing an identification mode, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field.

S103: and determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field.

After the identification mode corresponding to the target field is determined, the data item of the field to be identified is identified by utilizing the identification mode, so that an identification result of whether the data item of the field to be identified is matched with the target field or not is obtained. And then, determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field.

It can be appreciated that, since the data item of the field to be identified can characterize the attribute of the field to be identified, whether the field to be identified matches the target field can be determined according to the identification result of whether the data item of the field to be identified matches the target field.

It should be noted that, the field to be identified may be each field in the data table to be identified, and in practical application, the data table to be identified may be a data table in the HIS system. The data items of the field to be identified may be plural, for example, the data items of the field to be identified are asthma, bronchitis, heart disease, etc. When the data items of the field to be identified are identified by utilizing the identification mode corresponding to the target field, the identification can be performed on each data item of the field to be identified, or a plurality of data items can be selected from the data items to be identified, so that an identification result of whether each data item is matched with the target field or not can be obtained. And then, determining whether the field to be identified is matched with the target field according to the identification result of whether each data item corresponding to the field to be identified is matched with the target field.

In specific implementation, when determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field, the determination can be performed in a voting mode. Specifically, among the recognition results of whether the randomly selected plurality of data items in the field to be recognized are matched with the target field, if the recognition result matched with the target field is more than the recognition result not matched with the target field, the field to be recognized is determined to be matched with the target field, and if the recognition result matched with the target field is less than or equal to the recognition result not matched with the target field, the field to be recognized is determined to be not matched with the target field. That is, a plurality of data items may be selected randomly from the data items corresponding to the field to be identified, and whether the field to be identified is matched with the target field is determined by using the identification result of whether the selected data items are matched with the target field, specifically, whether the field to be identified is matched with the target field is determined by adopting a "minority-compliance-majority" mode.

The specific implementation of identifying the data item of the field to be identified by using the identification manner and obtaining the identification result of whether the data item of the field to be identified matches the target field will be described in the following embodiments.

S104: and determining the field to be identified, which is matched with the target field, as a matching field of the target field.

And when the field to be identified matched with the target field is determined according to the identification result of the data item of the field to be identified matched with the target field, determining the field to be identified as the matched field of the target field. Specifically, when each data item of the field to be identified is matched with the target field or a preset number of data items of the field to be identified are matched with the target field, the field to be identified is determined to be matched with the target field, and the field to be identified is a matched field of the target field.

Based on the above description, when the matching field is identified, it is determined whether the data item corresponding to the field to be identified is matched with the target field, and then whether the field to be identified is matched with the target field is determined according to the identification result of whether the data item corresponding to the field to be identified is matched with the target field. Because the representation forms of the field to be identified and the target field are not uniform, the field to be identified and the target field cannot be directly matched, but the data item which can represent the field to be identified is matched with the target field, so that the identification of the matched field is realized. In addition, different recognition modes are adopted for different target fields, so that the efficiency of recognizing the data items is improved.

As can be seen from the foregoing embodiments, the embodiments of the present application provide three ways of obtaining the recognition result, and for better understanding the implementation process of each recognition method, the following description will be given with reference to the accompanying drawings.

Referring to fig. 2, the flowchart of a method for obtaining a recognition result according to an embodiment of the present application is shown in fig. 2, where the method may include:

s201: and when the recognition mode is that the deep learning model recognition is adopted, acquiring a target deep learning model corresponding to the target field.

For some fields with strong expertise, the data items may appear in a lot, and the identification manner of the fields can be that deep learning model identification is adopted. When the identification mode of the target field is determined to be the deep learning model identification, a target deep learning model corresponding to the target field is obtained, and the data item of the field to be identified is identified by using the target deep learning model. In the fields identified by the deep learning model in the identification manner, each field may correspond to one deep learning model, and then the target deep learning model corresponding to the target field needs to be acquired first. The target deep learning model corresponding to the target field is trained according to positive sample data and negative sample data, wherein the positive sample data is the characteristic representation of the data item matched with the target field, and the negative sample data is the characteristic representation of the data item not matched with the target field. That is, the target deep learning model may identify whether the input data item matches the target field.

S202: a characteristic representation of the data item of the field to be identified is generated.

It can be understood that, when the target deep learning model is trained, the target deep learning model is generated by training the feature representation of the data item, and when the data item of the field to be identified is identified by using the target deep learning model, the feature representation of the data item of the field to be identified needs to be obtained first, so that the feature representation of the data item of the field to be identified is input into the target deep learning model to obtain the identification result.

In a specific implementation, the characteristic representation of the data item of the field to be identified may be generated by:

1) Text features of data items of the field to be identified are extracted.

That is, for each data item of a field to be identified, the text feature of that data item is lifted. The text features include one or more of character features, inter-character features, word features, and inter-word feature.

Specifically, when extracting text features of a data item of a field to be identified, any one or more of the following combinations may be extracted:

11 According to the character characteristics obtained by the text training of the medical data, converting each character of the data item of the field to be identified into a first character characteristic value, and determining the first character characteristic value of each character of the data item of the field to be identified as the character characteristics of the data item of the field to be identified.

And for each word in the data item of the field to be identified, converting each word of the data item of the field to be identified into a first word characteristic value according to the word characteristics obtained by training the medical data text, and determining the first word characteristic value of each word of the data item of the field to be identified as the word characteristics of the data item of the field to be identified. The deep learning method can be used for training the text of medical data (such as common medical nouns, organization names, medicine names, medical insurance names, disease names and the like) to obtain the character characteristics of each character in the medical field. In addition, the text of the medical data can be trained to obtain the word characteristics of each word in the medical field. The character feature and the expression form of the word feature can be feature vectors.

12 Extracting single words or multiple words adjacent to the first target word and within a preset range of the first target word to form a first word group, converting the first word group into a first word group characteristic value according to word characteristics obtained through text training of medical data, determining the first word group characteristic value as the position characteristics of the first target word, and determining the position characteristics of each first target word as the inter-word position characteristics of the data items of the field to be identified.

And taking each word in the data of the field to be identified as a first target word, and extracting a first word group consisting of single words or multiple words which are adjacent to the first target word and within a preset range of the first target word. According to character characteristics obtained through medical data text training, converting a first character set into a first character set characteristic value, and determining the first character set characteristic value as the position characteristic of a first target character. And after the position characteristics of each first target word of the data item of the field to be identified are acquired, determining the position characteristics corresponding to each first target word as the inter-word position characteristics of the data item of the field to be identified. The single word or the multiple words in the preset range are windows for setting the extracted words, and the window can be set according to actual application conditions. For example, when the window is 1, 1 word is extracted forward based on the position of the first target word, and 1 word is extracted backward as 2 first word groups respectively; when the window is 2, 2 words are extracted forward based on the position of the first target word, and 2 words are extracted backward as 2 first word groups respectively.

For example, the data item corresponding to the field to be identified is "senile dementia", the extraction window is 2, when "old" is the first target word, the data item is empty when being extracted forward, and the symbol which is preset to represent the empty, such as "-" mark, "is extracted backward as" senile dementia ", and" - "" and "senile dementia" are respectively used as the first word groups, each first word group is converted into a first word group characteristic value, and the two first word group characteristic values are determined as the position characteristics of "old"; when the dementia is the first target word, the front is extracted as the senile, the back is extracted as the foolproof, the senile and foolproof are respectively used as the first word groups, each first word group is converted into a first word group characteristic value, and the first word group characteristic value is determined as the position characteristic of the dementia. After the position feature of each word in the data item "senile dementia" is obtained, the position feature of each word is determined as the inter-word position feature of "senile dementia".

13 The method comprises the steps of) segmenting the data item of the field to be identified, converting each segmented word of the data item of the field to be identified into first word characteristic values according to word characteristics obtained through medical data text training, and determining each first word characteristic value as the word characteristic of the data item of the field to be identified.

In this embodiment, the data item of the field to be identified is subjected to word segmentation processing to obtain each word segment, each word segment is converted into a first word feature value according to the word feature obtained through training of the medical training text, and each first word feature value is determined to be the word feature of the data item of the field to be identified. The specific implementation of word segmentation processing for the data item of the field to be identified can be implemented by adopting a traditional word segmentation method, and this embodiment is not described herein again.

14 The method comprises the steps of) segmenting a data item of a field to be identified, extracting second target segmented words which are adjacent to first target segmented words and are in a preset range of the first target segmented words, converting the second target segmented words into second word characteristic values according to word characteristics obtained through medical data text training, determining the second word characteristic values as position characteristics of the first target segmented words, and determining the position characteristics of each first target segmented word as inter-word position characteristics of the data item of the field to be identified.

In this embodiment, word segmentation is performed on the data item of the field to be identified to obtain each word segment, the word segment not given is used as a first target word segment, and a second target word segment adjacent to the first target word segment and within a preset range of the first target word segment is extracted. And then, converting the second target word into a second word characteristic value according to word characteristics obtained through medical data text training, determining the second word characteristic value as the position characteristics of the first target word, and finally determining the position characteristics of each first target word as the inter-word position characteristics of the data items of the field to be identified.

The first target word segmentation preset range refers to a window for extracting adjacent words based on the first target word segmentation, and the size of the window can be set according to actual application conditions. For example, when the window is 1, 1 second target word is extracted forward based on the position of the first target word, 1 second target word is extracted backward, the extracted two second target words are respectively converted into second word characteristic values, and the second word characteristic values are determined as the position characteristics of the first target word; when the window is 2, 2 second target words are extracted forwards based on the position of the first target word, 2 second target words are extracted backwards, the extracted 4 second target words are respectively converted into second word characteristic values, and the second word characteristic values are determined to be the position characteristics of the second target word.

For example, the data item of the field to be identified is "senile dementia", the word segmentation results are "senile", "dementia" and "symptom", the extraction window is 1, when "senile" is the first target word segmentation, it is empty "-when it is extracted forward" -when it is extracted backward "dementia", the 'dementia' and the 'dementia' are used as second target word segments, the two second target word segments are respectively converted into second word characteristic values, and the two second word characteristic values are determined to be the position characteristics of the 'senile' of the first target word segment; when the dementia is the first target word, the first target word is extracted forward to be the second target word of senile, the second target word of symptom is extracted backward, the senile and symptom are respectively converted into the characteristic value of the second word, and the characteristic value is determined to be the position characteristic of the dementia. Similarly, when the first target word is "symptom", the second target word extracted forward is "dementia", and the fourth target word extracted backward is "null" -respectively converting "dementia" and "-into second word feature values, and using them as the position features of" symptom ". After the position feature of each word in the data item "senile dementia" is obtained, the position feature of each word is determined as the inter-word position feature of "senile dementia".

From the above description, it can be known that various text features of the data item of the field to be identified, namely, character features, inter-character position features, word features and inter-word position features, can be obtained. Wherein each feature may comprise a plurality of features. For example, the data item may correspond to a plurality of word segments, and the word features include word features corresponding to each word segment, and similarly, the word features may include word features corresponding to each word, the inter-word position features may include inter-word position features corresponding to each word, and so on.

2) And calculating the matching degree characteristics of the data items of the field to be identified and each training text set.

In this embodiment, the matching degree feature of the data item of the field to be identified and each training text set may also be calculated, that is, the correlation feature of the data item of the field to be identified and each training text set. The training text set refers to data item sets corresponding to different fields, and the data item set corresponding to one field is a training text set. For example, a training text set corresponding to a medicine field [ eszolam tablet benzathine penicillin long-acting xilin ], a training text set corresponding to a medical insurance field [ rural cooperative medical town medical insurance business insurance ].

Specifically, the matching program features of the data item of the field to be identified and the training text set may be obtained by calculation in the following manner:

21 Acquiring the matching value of the data item of the field to be identified and the j-th data item in the i-th training text set, wherein i and j are positive integers, and each training text set comprises data items of one category.

For each data item of the field to be identified, a matching value between the data item and each data item in the training text set is calculated, so that the matching value of the data item and each data item in the training text set is obtained.

22 According to the matching value of the data item of the field to be identified and the j-th data item in the i-th training text set, calculating the matching degree value of the data item of the field to be identified and the i-th training text set.

And after obtaining the matching value of the data item of the field to be identified and each data item in a certain training text set, calculating the matching degree value of the data item of the field to be identified and the training text set by using the matching value. When a plurality of training text sets exist, a matching degree value of each data item of the field to be recognized and each training text set is obtained through calculation. For example, if the field to be identified corresponds to 3 data items and there are 20 training text sets, the matching degree value of each data item and the 20 training text sets is calculated, and the matching degree value coexist in 60.

The calculation can be specifically performed by using the following formula:

wherein qi represents the matching degree value of the data item of the field to be identified and the ith training text set, u _i And (3) representing a correlation coefficient corresponding to the ith training text set, wherein wij represents a matching value of a data item of a field to be identified and the jth data item in the ith training text set, and N represents that the ith training text set comprises N data items.

23 Determining the matching degree value of the data item of the field to be identified and each training text set as the matching degree characteristic of the data item of the field to be identified and each training text set.

And after obtaining the matching degree values of the data items of the field to be identified and each training text set, determining the matching degree value of the data items of the field to be identified and a certain training text set as the matching degree characteristic of the data items of the field to be identified and the training text set.

3) And forming the characteristic representation of the data item of the field to be identified by the text characteristic of the data item of the field to be identified and the matching degree characteristic of the data item of the field to be identified and each training text set.

After each text feature of the data item of the field to be identified and the matching degree feature of the data item of the field to be identified and each training text set are obtained, all the obtained text features and the matching degree features form the feature representation of the data item of the field to be identified.

S203: and inputting the characteristic representation of the data item of the field to be identified into a target deep learning model corresponding to the target field, and obtaining an identification result of whether the data item of the field to be identified is matched with the target field.

After the feature representation of the data item of the field to be identified is obtained, the feature representation is input into a target deep learning model, and the target deep learning model outputs an identification result of whether the data item corresponding to the feature representation matches with the target field by identifying the feature representation. Specifically, when the characteristic representation of the data item of the field to be identified reaches a preset similarity with the positive sample data, the data item of the field to be identified is identified as the identification result of matching with the target field; when the characteristic representation of the data item of the field to be identified reaches the preset similarity with the negative sample data, the data item of the field to be identified is not matched with the identification result of the target field.

The training process of the target deep learning model may be:

1) The method comprises the steps of acquiring a data item matched with a target field, generating a characteristic representation of the data item matched with the target field, and determining the characteristic representation of the data item matched with the target field as positive sample data.

In this embodiment, first, a data item matching a target field is acquired, and a feature representation of the data item matching the target field is generated, and the feature representation is determined as positive sample data. The data item matched with the target field is a data item corresponding to the target field, for example, the target field is a registration type field, the data item corresponding to the registration type field is a medical, surgical, gynecological and the like, and the data item is used as the data item matched with the registration type field; and if the target field is a disease field, the data items corresponding to the disease field are Alzheimer's disease, senile dementia, heart disease, asthma and the like, and the data items are used as data items matched with the disease field.

In a specific implementation, text features of the data item matching the target field are first extracted, the text features including one or more of word features, inter-word feature, word features, and inter-word feature. The word features, the inter-word features, the word features, and the inter-word features may be extracted by the above method, which is not described herein. Next, the matching degree characteristics of the data item matched with the target field and each training text set are calculated, and the specific calculation process can use formula (1). And finally, combining the text characteristics of the data item matched with the target field and the matching degree characteristics of the data direction matched with the target field and each training text set into a characteristic representation of the data direction matched with the target field, wherein the characteristic representation is positive sample data.

2) The method comprises the steps of acquiring a data item which is not matched with a target field, generating a characteristic representation of the data item which is not matched with the target field, and determining the characteristic representation of the data item which is not matched with the target field as negative sample data.

In this embodiment, first, a data item that does not match the target field is acquired, and a feature representation of the data item that does not match the target field is generated, and the feature representation is determined as negative-sample data. The data items not matched with the target field may be data items corresponding to non-target fields, for example, the target field is a registration type field, the data items not matched with the registration type field are data items corresponding to other fields except the registration type field, the other fields are disease fields, for example, the data items corresponding to the disease fields are data items not matched with the registration type field, such as alzheimer's disease, senile dementia, heart disease, asthma, and the like, and the data items are data items not matched with the registration type field.

In a specific implementation, text features of data items that do not match the target field are first extracted, the text features including one or more of word features, inter-word feature, word features, and inter-word feature. The word features, the inter-word features, the word features, and the inter-word features may be extracted by the above method, which is not described herein. Secondly, calculating the matching degree characteristics of the data items which are not matched with the target field and the training text sets, wherein the specific calculation process can use the formula (1). And finally, composing the text characteristics of the data items which are not matched with the target field, the matching degree characteristics of the data items which are not matched with the target field and the training text sets into a characteristic representation of the data direction which is not matched with the target field, and determining the characteristic representation as negative sample data.

3) And training according to the positive sample data and the negative sample data to obtain a target deep learning model corresponding to the target field.

After the positive sample data and the negative sample data are obtained, the positive sample data and the negative sample data are used as training data to train the initial learning model so as to obtain a target deep learning model corresponding to the target field, and the target deep learning model can identify data items similar to the positive sample data and data items similar to the negative sample data.

It should be noted that, in order to ensure that the trained deep learning model can accurately identify the data items belonging to the same class as the positive sample data, the difference value between the data amount of the positive sample data and the data amount of the negative sample data needs to be within a preset threshold range, and the corresponding fields of the negative sample data should be as abundant as possible.

Referring to fig. 3, the flowchart of another method for obtaining a recognition result according to an embodiment of the present application is shown in fig. 3, where the method may include:

s301: when the recognition mode is character matching recognition, acquiring keywords corresponding to the target field.

For some fields, the content of the data item may appear is fixed, and the identification mode of the fields can be character matching identification. In this embodiment, when the recognition mode corresponding to the target field is character matching recognition, a keyword corresponding to the target field is obtained, where the keyword represents a data item that may occur in the target field. For example, keywords corresponding to the medical insurance field include rural cooperative medical care, town medical insurance, business insurance, and the like.

S302: and matching the data item of the field to be identified with the keyword corresponding to the target field.

After the keywords corresponding to the target field are determined, matching each data item with the keywords corresponding to the target field one by one for each data item of the field to be identified.

S303: and if the data item of the field to be identified is matched with the keyword corresponding to the target field, acquiring an identification result of the data item of the field to be identified matched with the target field.

S304: and if the data item of the field to be identified is not matched with the keyword corresponding to the target field, acquiring an identification result of the data item of the field to be identified, which is not matched with the target field.

And matching each data item of the field to be identified with a keyword corresponding to the target field, and if a certain data item of the field to be identified is matched with a certain keyword corresponding to the target field, acquiring an identification result of the data item of the field to be identified matched with the target field. And if a certain data item of the field to be identified is not matched with a certain keyword corresponding to the target field, acquiring an identification result of the mismatch between the data item of the field to be identified and the target field. That is, each data item of the field to be identified is matched with the keyword of the target field, so as to obtain an identification result of whether each data item of the field to be identified is matched with the keyword of the target field.

For example, the field to be identified corresponds to 3 data items, namely a data item a, a data item b and a data item c, wherein the data item a is matched with a keyword corresponding to the target field, and the identification result is that the data item a is matched with the target field; if the data item b is not matched with the keyword corresponding to the target field, the recognition result is that the data item b is not matched with the target field; and if the data item c is matched with the keyword corresponding to the target field, the recognition result is that the data item c is matched with the target field.

Therefore, when the recognition mode of the target field is character matching recognition, the recognition result of whether the data item of the field to be recognized is matched with the target field can be obtained through the process.

Referring to fig. 4, a flowchart of another method for obtaining a recognition result according to an embodiment of the present application is shown in fig. 4, where the method may include:

s401: and when the identification mode is regular rule matching identification, acquiring a regular rule corresponding to the target field.

For some fields, the content of which the data item may appear generally meets certain specific rules, and then the fields can be identified by regular rule matching. In this embodiment, when the identification mode corresponding to the target field is that regular rule matching identification is adopted, a regular rule corresponding to the target field is obtained, so as to determine whether the data item of the field to be identified is matched with the identification result of the target field by using the regular rule.

In specific implementation, the regular rule corresponding to the target field can be generated according to the characteristics of the data item corresponding to the target field. For example, for the target field to be the birth time, the corresponding data item is generally xx year xx month xx day, and the regular rule corresponding to the target field may be [ xxxx-xx-xx 8], where 8 represents the number of included digits; for the case that the target field is a contact field, the corresponding data item is usually 1xxxxxxxxx, and then the regular rule corresponding to the target field may be [1xxxxxxxxx 11], where 11 represents the number of digits.

S402: and judging whether the data item of the field to be identified meets the regular rule corresponding to the target field.

And after determining the regular rule corresponding to the target field, judging whether the data item meets the regular rule corresponding to the target field for each data item of the field to be identified, thereby obtaining an identification result of whether each data item is matched with the target field.

Specifically, when the data item of the field to be identified meets the regular rule corresponding to the target field, S403 is executed; and when the data item of the field to be identified does not meet the regular rule corresponding to the target field, executing S404.

S403: and acquiring a recognition result of matching the data item of the field to be recognized with the target field.

S404: and acquiring a recognition result of mismatching of the data item of the field to be recognized and the target field.

That is, when the data item of the field to be identified meets the regular rule corresponding to the target field, the data item is indicated to be matched with the target field, and the identification result of the data item matched with the target field is obtained. When the data item of the field to be identified does not meet the regular rule corresponding to the target field, indicating that the data item is not matched with the target field, and acquiring an identification result of the data item not matched with the target field.

Therefore, when the identification mode corresponding to the target field is the identification mode adopting the regular rule matching, the identification result of whether each data item of the field to be identified is matched with the target field can be determined through the process.

In addition, in one possible implementation manner, the embodiment provides an implementation manner of determining the field to be identified, which specifically includes: determining a target data table in which a target field is located; searching a data table matched with the target data table in the data table to be identified; and determining the field in the data table matched with the target data table as a field to be identified. That is, the target data table where the target field is located is determined, then the data table matched with the target data table is searched in the database comprising various data tables, and each field in the data table is determined as the field to be identified.

For example, the target field is a date of birth field, which may be present in a plurality of data tables such as a registry, an admission registry, etc., and if the date of birth field is a field in the registry, the registry is determined to be the target data table. Then, searching the data table matched with the registration table in the database, for example, when searching the data tables such as the registration table, the patient registration table and the like, determining the two data tables as the data table matched with the target data table, and determining each field in the registration table and the patient registration table as a field to be identified.

Based on the above method embodiments, the embodiments of the present application further provide an apparatus for identifying a matching field, and the apparatus will be described below with reference to the accompanying drawings.

Referring to fig. 5, the apparatus for identifying a matching field according to an embodiment of the present application may include:

a first determining unit 501, configured to determine an identification manner of the target field;

an obtaining unit 502, configured to identify a data item of a field to be identified by using the identification manner, and obtain an identification result of whether the data item of the field to be identified is matched with the target field;

a second determining unit 503, configured to determine whether the field to be identified matches the target field according to an identification result of whether the data item of the field to be identified matches the target field;

and a third determining unit 504, configured to determine a field to be identified that matches the target field as a matching field of the target field.

In one possible implementation manner, the acquiring unit includes:

the first acquisition subunit is used for acquiring a target deep learning model corresponding to the target field when the recognition mode is recognition by adopting a deep learning model; the target deep learning model corresponding to the target field is obtained by training according to positive sample data and negative sample data, the positive sample data is characteristic representation of a data item matched with the target field, and the negative sample data is characteristic representation of a data item not matched with the target field;

A generation subunit for generating a characteristic representation of the data item of the field to be identified;

and the second acquisition subunit is used for inputting the characteristic representation of the data item of the field to be identified into a target deep learning model corresponding to the target field, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field.

In one possible implementation, the generating subunit includes:

an extraction subunit, configured to extract text features of data items of a field to be identified, where the text features include one or more of word features, inter-word feature, word features, and inter-word feature;

the calculating subunit is used for calculating the matching degree characteristics of the data items of the field to be identified and each training text set;

and the composition subunit is used for composing the text characteristics of the data items of the field to be identified and the matching degree characteristics of the data items of the field to be identified and each training text set into the characteristic representation of the data items of the field to be identified.

In a possible implementation manner, the calculating subunit is specifically configured to obtain a matching value of the data item of the field to be identified and a j-th data item in an i-th training text set, where i and j are positive integers, and each training text set includes data items in the same category;

In one possible implementation manner, the acquiring unit includes:

a third obtaining subunit, configured to obtain a keyword corresponding to the target field when the recognition mode is character matching recognition;

the matching subunit is used for matching the data item of the field to be identified with the keyword corresponding to the target field;

a fourth obtaining subunit, configured to obtain, if the data item of the field to be identified matches the keyword corresponding to the target field, an identification result that the data item of the field to be identified matches the target field;

And a fifth obtaining subunit, configured to obtain, if the data item of the field to be identified is not matched with the keyword corresponding to the target field, an identification result that the data item of the field to be identified is not matched with the target field.

In one possible implementation manner, the acquiring unit includes:

a sixth obtaining subunit, configured to obtain a regular rule corresponding to the target field when the identification mode is that regular rule matching identification is adopted;

the judging subunit is used for judging whether the data item of the field to be identified meets the regular rule corresponding to the target field;

a seventh obtaining subunit, configured to obtain, when the determination result of the determining subunit is that the data item of the field to be identified meets the regular rule corresponding to the target field, an identification result that the data item of the field to be identified matches with the target field;

and the eighth obtaining subunit is configured to obtain, when the judging result of the judging subunit is that if the data item of the field to be identified does not meet the regular rule corresponding to the target field, an identification result that the data item of the field to be identified is not matched with the target field.

In a possible implementation manner, the second determining unit is specifically configured to determine, among the recognition results of whether the randomly selected multiple data items in the field to be recognized match the target field, that the field to be recognized matches the target field if there is more recognition result than there is no match with the target field, and determine that the field to be recognized does not match with the target field if there is less than or equal to the recognition result that does not match with the target field.

In one possible implementation, the apparatus further includes:

a fourth determining unit, configured to determine a target data table in which the target field is located;

the searching unit is used for searching a data table matched with the target data table in the data table to be identified;

and a fifth determining unit, configured to determine a field in a data table matched with the target data table as a field to be identified.

It should be noted that, in this embodiment, the implementation of each unit may refer to the above method embodiment, and this embodiment is not described herein again.

In addition, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium is stored with instructions, and when the instructions run on the terminal equipment, the terminal equipment is caused to execute the method for identifying the matching field.

The embodiment of the application provides equipment for identifying a matching field, which comprises the following steps: the processor is used for realizing the method for identifying the matching field when executing the computer program. Based on the method, when the matching field is identified, whether the data item corresponding to the field to be identified is matched with the target field is determined, and whether the field to be identified is matched with the target field is determined according to the identification result of whether the data item corresponding to the field to be identified is matched with the target field. Because the representation forms of the field to be identified and the target field are not uniform, the field to be identified and the target field cannot be directly matched, but the data item which can represent the field to be identified is matched with the target field, so that the identification of the matched field is realized. In addition, different recognition modes are adopted for different target fields, so that the efficiency of recognizing the data items is improved.

It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system or device disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant points refer to the description of the method section.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of identifying matching fields, the method comprising:

determining the identification mode of the target field;

identifying the data item of the field to be identified by utilizing the identification mode, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field; the recognition mode comprises the steps of adopting a deep learning model recognition or adopting character matching recognition;

determining a field to be identified matched with the target field as a matched field of the target field;

the determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field, includes:

in the recognition results of whether the randomly selected multiple data items in the field to be recognized are matched with the target field, if the recognition results matched with the target field are more than the recognition results not matched with the target field, determining that the field to be recognized is matched with the target field, and if the recognition results matched with the target field are less than or equal to the recognition results not matched with the target field, determining that the field to be recognized is not matched with the target field;

the identifying the data item of the field to be identified by using the identifying mode, and obtaining the identifying result of whether the data item of the field to be identified is matched with the target field, including:

When the recognition mode is recognition by adopting a deep learning model, acquiring a target deep learning model corresponding to the target field; the target deep learning model corresponding to the target field is obtained by training according to positive sample data and negative sample data, the positive sample data is characteristic representation of a data item matched with the target field, and the negative sample data is characteristic representation of a data item not matched with the target field; generating a characteristic representation of the data item of the field to be identified; inputting the characteristic representation of the data item of the field to be identified into a target deep learning model corresponding to the target field, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field;

the generating of the characteristic representation of the data item of the field to be identified comprises:

extracting text features of data items of a field to be identified, wherein the text features comprise one or more of character features, inter-character features, word features and inter-word features; calculating the matching degree characteristics of the data items of the field to be identified and each training text set; the text characteristics of the data items of the field to be identified and the matching degree characteristics of the data items of the field to be identified and each training text set form characteristic representation of the data items of the field to be identified;

The calculating the matching degree characteristics of the data items of the field to be identified and each training text set comprises the following steps:

acquiring a matching value of a data item of the field to be identified and a j-th data item in an i-th training text set, wherein i and j are positive integers, and each training text set comprises data items of the same category; calculating the matching degree value of the data item of the field to be identified and the ith training text set according to the matching value of the data item of the field to be identified and the jth data item in the ith training text set; and determining the matching degree value of the data item of the field to be identified and each training text set as the matching degree characteristic of the data item of the field to be identified and each training text set.

2. The method of claim 1, wherein the text characteristics of the data item of the field to be identified include any one or a combination of the following:

3. The method of claim 1, wherein the training process of the target deep learning model corresponding to the target field comprises:

4. The method according to claim 1, wherein the identifying the data item of the field to be identified by using the identifying method, and obtaining an identification result of whether the data item of the field to be identified matches the target field, includes:

5. The method according to claim 1, wherein the method further comprises:

determining a target data table in which a target field is located;

6. An apparatus for identifying matching fields, the apparatus comprising:

the acquisition unit is used for identifying the data item of the field to be identified by utilizing the identification mode and acquiring an identification result of whether the data item of the field to be identified is matched with the target field; the recognition mode comprises the steps of adopting a deep learning model recognition or adopting character matching recognition;

a third determining unit, configured to determine a field to be identified that matches the target field as a matching field of the target field;

the second determining unit is specifically configured to determine, among recognition results of whether the plurality of randomly selected data items in the field to be recognized match the target field, that the field to be recognized matches the target field if the recognition result matching the target field is greater than the recognition result not matching the target field, and determine that the field to be recognized does not match the target field if the recognition result matching the target field is less than or equal to the recognition result not matching the target field;

the acquisition unit includes:

the second acquisition subunit is used for inputting the characteristic representation of the data item of the field to be identified into a target deep learning model corresponding to the target field, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field;

the generating subunit includes:

the composition subunit is used for composing the text characteristics of the data items of the field to be identified and the matching degree characteristics of the data items of the field to be identified and each training text set into the characteristic representation of the data items of the field to be identified;

the computing subunit is specifically configured to obtain a matching value of the data item of the field to be identified and a j-th data item in an i-th training text set, where i and j are positive integers, and each training text set includes data items of the same category;

7. A computer readable storage medium having instructions stored therein, which when run on a terminal device, cause the terminal device to perform the method of identifying matching fields according to any of claims 1-5.

8. An apparatus for identifying matching fields, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of identifying matching fields as claimed in any one of claims 1 to 5 when the computer program is executed.