CN111104481A

CN111104481A - Method, device and equipment for identifying matching field

Info

Publication number: CN111104481A
Application number: CN201911304454.7A
Authority: CN
Inventors: 冯仓龙
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2020-05-05
Anticipated expiration: 2039-12-17
Also published as: CN111104481B

Abstract

The embodiment of the application discloses a method, a device and equipment for identifying a matching field. Because the expression forms of the field to be recognized and the target field are not uniform, the field to be recognized and the target field cannot be directly matched, and the data item which can represent the field to be recognized is matched with the target field, so that the recognition of the matched field is realized. In addition, different identification modes are adopted for different target fields, and the efficiency of identifying the data items is improved.

Description

Method, device and equipment for identifying matching field

Technical Field

The present application relates to the field of information technology processing, and in particular, to a method, an apparatus, and a device for identifying a matching field.

Background

With the rapid popularization and development of internet technology, a great amount of data is generated in each application field. The data representation forms generated in the same field are different due to different configurations of different users, for example, multiple description modes exist for the same thing, so that the data in the database is personalized. In practical application, when the same type of data is searched from a database, the required data cannot be found due to personalized expression of the data.

Disclosure of Invention

In view of this, embodiments of the present application provide a method, an apparatus, and a device for identifying a matching field, so as to implement accurate query of the matching field.

In order to solve the above problem, the technical solution provided by the embodiment of the present application is as follows:

a method of identifying matching fields, the method comprising:

determining an identification mode of a target field;

identifying the data item of the field to be identified by utilizing the identification mode, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field;

determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field;

and determining the field to be identified which is matched with the target field as a matching field of the target field.

In a possible implementation manner, the identifying, by using the identification manner, the data item of the field to be identified, and obtaining an identification result of whether the data item of the field to be identified matches the target field, includes:

when the identification mode is deep learning model identification, acquiring a target deep learning model corresponding to the target field; the target deep learning model corresponding to the target field is obtained by training according to positive sample data and negative sample data, wherein the positive sample data is represented by the characteristics of the data items matched with the target field, and the negative sample data is represented by the characteristics of the data items unmatched with the target field;

generating a feature representation of a data item of a field to be identified;

and inputting the feature representation of the data item of the field to be recognized into a target deep learning model corresponding to the target field, and acquiring a recognition result of whether the data item of the field to be recognized is matched with the target field.

In one possible implementation, the generating a feature representation of the data item of the field to be identified includes:

extracting text features of data items of fields to be identified, wherein the text features comprise one or more of character features, inter-character position features, word features and inter-word position features;

calculating the matching degree characteristics of the data items of the fields to be recognized and each training text set;

and combining the text features of the data items of the fields to be recognized and the matching degree features of the data items of the fields to be recognized and each training text set into the feature representation of the data items of the fields to be recognized.

In one possible implementation manner, the text features of the data item of the field to be identified include any one or more of the following combinations:

converting each word of the data item of the field to be recognized into a first word characteristic value according to word characteristics obtained by medical data text training, and determining the first word characteristic value of each word of the data item of the field to be recognized as the word characteristics of the data item of the field to be recognized;

extracting single characters or multiple characters adjacent to first target characters and within a preset range of the first target characters to form first character groups, converting the first character groups into first character group characteristic values according to character characteristics obtained by medical data text training, determining the first character group characteristic values as position characteristics of the first target characters, determining the position characteristics of the first target characters as inter-character position characteristics of data items of fields to be recognized, and respectively taking each character in the data items of the fields to be recognized by the first target characters;

performing word segmentation on the data item of the field to be recognized, converting each word segmentation of the data item of the field to be recognized into a first word feature value according to word features obtained by training a medical data text, and determining each first word feature value as the word features of the data item of the field to be recognized;

the method comprises the steps of segmenting a data item of a field to be recognized, extracting a second target segmentation adjacent to a first target segmentation and within a preset range of the first target segmentation, converting the second target segmentation into a second word characteristic value according to word characteristics obtained by medical data text training, determining the second word characteristic value as the position characteristics of the first target segmentation, determining the position characteristics of each first target segmentation as the inter-word position characteristics of the data item of the field to be recognized, and respectively taking each segmentation in the data item of the field to be recognized by the first target segmentation.

In a possible implementation manner, the calculating the matching degree feature of the data item of the field to be recognized and each training text set includes:

acquiring a matching value of the data item of the field to be identified and a jth data item in an ith training text set, wherein i and j are positive integers, and each training text set comprises data items of the same category;

calculating the matching degree value of the data item of the field to be recognized and the ith training text set according to the matching value of the data item of the field to be recognized and the jth data item in the ith training text set;

and determining the matching degree value of the data item of the field to be recognized and each training text set as the matching degree characteristic of the data item of the field to be recognized and each training text set.

In a possible implementation manner, the training process of the target deep learning model corresponding to the target field includes:

acquiring a data item matched with the target field, generating the characteristic representation of the data item matched with the target field, and determining the characteristic representation of the data item matched with the target field as positive sample data;

acquiring data items which do not match with the target field, generating characteristic representation of the data items which do not match with the target field, and determining the characteristic representation of the data items which do not match with the target field as negative sample data;

and training according to positive sample data and negative sample data to obtain a target deep learning model corresponding to the target field.

when the recognition mode is character matching recognition, obtaining keywords corresponding to the target field;

matching the data item of the field to be identified with the keyword corresponding to the target field;

if the data item of the field to be identified is matched with the keyword corresponding to the target field, acquiring an identification result of the data item of the field to be identified matched with the target field;

and if the data item of the field to be identified is not matched with the keyword corresponding to the target field, acquiring an identification result that the data item of the field to be identified is not matched with the target field.

when the identification mode is regular rule matching identification, acquiring a regular rule corresponding to the target field;

judging whether the data items of the fields to be identified meet the regular rules corresponding to the target fields;

if the data item of the field to be identified meets the regular rule corresponding to the target field, acquiring an identification result of the data item of the field to be identified matched with the target field;

and if the data item of the field to be identified does not meet the regular rule corresponding to the target field, acquiring an identification result that the data item of the field to be identified is not matched with the target field.

In a possible implementation manner, the determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field includes:

in the recognition results of whether the randomly selected plurality of data items in the field to be recognized are matched with the target field, if the recognition results matched with the target field are more than the recognition results not matched with the target field, the field to be recognized is determined to be matched with the target field, and if the recognition results matched with the target field are less than or equal to the recognition results not matched with the target field, the field to be recognized is determined not to be matched with the target field.

In one possible implementation, the method further includes:

determining a target data table where the target field is located;

searching a data table matched with the target data table in a data table to be identified;

and determining the fields in the data table matched with the target data table as the fields to be identified.

An apparatus to identify matching fields, the apparatus comprising:

the first determining unit is used for determining the identification mode of the target field;

the acquisition unit is used for identifying the data item of the field to be identified by utilizing the identification mode and acquiring an identification result of whether the data item of the field to be identified is matched with the target field;

a second determining unit, configured to determine whether the field to be identified matches the target field according to an identification result indicating whether the data item of the field to be identified matches the target field;

and the third determining unit is used for determining the field to be identified which is matched with the target field as the matching field of the target field.

A computer readable storage medium having stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of identifying matching fields.

An apparatus for identifying matching fields, comprising: the device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the method for identifying the matching field.

Therefore, the embodiment of the application has the following beneficial effects:

in this embodiment, an identification manner of the target field is first determined, and then the data item of the field to be identified is identified by using the identification manner, so as to obtain an identification result of whether the data item of the field to be identified is matched with the target field. And then, determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field, so as to determine the field to be identified matched with the target field as the matching field of the target field.

That is, when the matching field is identified, the embodiment of the application determines whether the data item corresponding to the field to be identified is matched with the target field, and then determines whether the field to be identified is matched with the target field according to the identification result of whether the data item corresponding to the field to be identified is matched with the target field. Because the expression forms of the field to be recognized and the target field are not uniform, the field to be recognized and the target field cannot be directly matched, and the data item which can represent the field to be recognized is matched with the target field, so that the recognition of the matched field is realized. In addition, different identification modes are adopted for different target fields, and the efficiency of identifying the data items is improved.

Drawings

Fig. 1 is a flowchart of a method for identifying matching fields according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for obtaining an identification result according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of another method for obtaining recognition results according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of another method for obtaining recognition results according to an embodiment of the present disclosure;

fig. 5 is a block diagram of an apparatus for identifying a matching field according to an embodiment of the present disclosure.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.

In order to facilitate understanding of the technical solutions provided by the embodiments of the present application, a description will be given of a background technology related to the embodiments of the present application.

Due to different configurations of different users in the same field, the produced data are different in expression form, for example, multiple description modes exist for the same thing, so that the data in the database are personalized. In practical application, when the same type of data is searched from a database, the required data cannot be found due to personalized expression of the data.

Specifically, in the medical field, different health institutions use a Hospital Information System (HIS) to manage medical data of the present unit. However, the HIS systems of different manufacturers have different configuration formats, resulting in a large difference in the format of the formed medical data. When the health information platform extracts the required medical data from the data tables of the HIS systems of all organizations, the fields of the data tables in the health information platform are standard fields, and the fields of the data tables of all HIS systems are non-standard fields with large differences, so that the required fields cannot be quickly searched from the data tables of the HIS systems.

Based on this, an embodiment of the present application provides a method for identifying a matching field, which includes determining a target field and an identification manner corresponding to the target field, and identifying a data item of a field to be identified by using the identification manner to obtain an identification result of whether the data item of the field to be identified matches the target field. Namely, the data item which can characterize the field to be identified is matched with the target field to obtain the identification result. And then, determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field, so as to determine the field to be identified matched with the target field as the matching field of the target field. It can be seen that although the target field is not uniform with the field to be identified, the data item corresponding to the field to be identified may be used to match with the target field, so as to determine the matching field corresponding to the target field.

The fields can be understood as information characterizing the service data types in the data table, such as a disease field, a medical insurance field, and an order field, and the data items of the fields are specific information to be filled in the HIS system, for example, the data items corresponding to the disease fields are chronic bronchitis, alzheimer disease, asthma, and the like.

In order to facilitate understanding of technical solutions provided by the embodiments of the present application, a method for identifying a matching field provided by the embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 1, which is a flowchart of a method for identifying a matching field according to an embodiment of the present application, as shown in fig. 1, the method may include:

s101: and determining the identification mode of the target field.

In this embodiment, the target field may be understood as a field that needs to be field-matched, and in practical applications, the target field may be any standard field in a data table of a health information platform. And when the target field is determined, determining an identification mode corresponding to the target field so as to identify the data item corresponding to the field to be identified subsequently by using the identification mode.

It can be understood that, since different fields have different characteristics, in order to make full use of the characteristics of the fields to identify subsequent data items, different identification modes can be set for the different fields in advance. Specifically, a deep learning model identification mode can be set for fields with strong speciality, such as disease fields, medical institution fields, medicine fields, and the like; for the fields with fixed data item content, character matching identification modes can be set, such as medical insurance type fields, registration type fields and the like; the fields with specific rules for data items in the fields, such as digital information and other contents, can be set to adopt a regular rule identification mode, such as a contact mode field, a birth date field, a surgery time field and the like. Specific implementation of each identification manner will be described in the following embodiments.

S102: and identifying the data item of the field to be identified by utilizing an identification mode, and acquiring an identification result of whether the data item of the field to be identified is matched with the target field.

S103: and determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field.

And after the identification mode corresponding to the target field is determined, identifying the data item of the field to be identified by using the identification mode so as to obtain an identification result of whether the data item of the field to be identified is matched with the target field. And then, determining whether the field to be identified is matched with the target field according to the identification result of whether the data item of the field to be identified is matched with the target field.

It can be understood that, since the data item of the field to be identified can characterize the attribute of the field to be identified, whether the field to be identified and the target field match can be determined according to the identification result of whether the data item of the field to be identified and the target field match.

It should be noted that, the field to be identified may be each field in the data table to be identified, and in practical applications, the data table to be identified may be a data table in the HIS system. The data item of the field to be identified may be plural, for example, the data item of the field to be identified is asthma, bronchitis, heart disease, etc. When the data items of the field to be identified are identified by using the identification mode corresponding to the target field, each data item of the field to be identified can be identified, or a plurality of data items can be selected from the data items for identification, so as to obtain an identification result whether each data item is matched with the target field. And then, determining whether the field to be identified is matched with the target field according to the identification result of whether each data item corresponding to the field to be identified is matched with the target field.

In specific implementation, when determining whether the field to be identified matches the target field according to the identification result of whether the data item of the field to be identified matches the target field, a voting mode may be adopted for determination. Specifically, in the recognition results of whether the randomly selected plurality of data items in the field to be recognized match the target field, if the recognition result matching the target field is more than the recognition result not matching the target field, it is determined that the field to be recognized matches the target field, and if the recognition result matching the target field is less than or equal to the recognition result not matching the target field, it is determined that the field to be recognized does not match the target field. That is, a plurality of data items may be randomly selected from the data items corresponding to the field to be identified, and whether the field to be identified and the target field are matched may be determined by using the identification result of whether the selected data item and the target field are matched, specifically, whether the field to be identified and the target field are matched may be determined by using a "minority-subject-majority" manner.

A specific implementation of identifying the data item of the field to be identified by using the identification manner to obtain an identification result indicating whether the data item of the field to be identified matches the target field will be described in the following embodiments.

S104: and determining the field to be identified which is matched with the target field as the matching field of the target field.

And when the field to be recognized matched with the target field is determined according to the recognition result of the matching between the data item of the field to be recognized and the target field, determining the field to be recognized as the matching field of the target field. Specifically, when each data item of the field to be identified is matched with the target field or a plurality of preset data items of the field to be identified are matched with the target field, it is determined that the field to be identified is matched with the target field, and the field to be identified is a matched field of the target field.

Based on the above description, when identifying the matching field, first determining whether the data item corresponding to the field to be identified is matched with the target field, and then determining whether the field to be identified is matched with the target field according to the identification result of whether the data item corresponding to the field to be identified is matched with the target field. Because the expression forms of the field to be recognized and the target field are not uniform, the field to be recognized and the target field cannot be directly matched, and the data item which can represent the field to be recognized is matched with the target field, so that the recognition of the matched field is realized. In addition, different identification modes are adopted for different target fields, and the efficiency of identifying the data items is improved.

As can be seen from the foregoing embodiments, the embodiments of the present application provide three ways of obtaining an identification result, and in order to better understand the implementation process of each identification way, the following description will be separately provided with reference to the accompanying drawings.

Referring to fig. 2, which is a flowchart of a method for obtaining an identification result according to an embodiment of the present application, as shown in fig. 2, the method may include:

s201: and when the identification mode is deep learning model identification, acquiring a target deep learning model corresponding to the target field.

For some fields with strong professionalism, the content of the data items may be many, and the identification mode of the fields may be deep learning model identification. And when the recognition mode of the target field is determined to be the deep learning model recognition, acquiring a target deep learning model corresponding to the target field so as to recognize the data item of the field to be recognized by using the target deep learning model. In the fields identified by the deep learning model, each field may correspond to one deep learning model, and the target deep learning model corresponding to the target field needs to be obtained first. The target deep learning model corresponding to the target field is obtained by training according to positive sample data and negative sample data, wherein the positive sample data is represented by the characteristics of the data items matched with the target field, and the negative sample data is represented by the characteristics of the data items unmatched with the target field. That is, the target deep learning model may identify whether the input data item matches the target field.

S202: a feature representation of the data item of the field to be identified is generated.

It can be understood that, when the target deep learning model is trained, the target deep learning model is generated by training using the feature representation of the data item, when the data item of the field to be recognized is recognized by using the target deep learning model, the feature representation of the data item of the field to be recognized needs to be obtained first, so that the feature representation of the data item of the field to be recognized is input into the target deep learning model to obtain the recognition result.

In a specific implementation, the feature representation of the data item of the field to be identified may be generated in the following manner, specifically:

1) and extracting text characteristics of the data items of the fields to be identified.

That is, for each data item of the field to be identified, the text feature of the data item is lifted. The text features comprise one or more of character features, inter-character position features, word features and inter-word position features.

Specifically, when extracting text features of the data items of the field to be identified, any one or more of the following combinations may be extracted:

11) converting each word of the data item of the field to be recognized into a first word characteristic value according to the word characteristics obtained by the medical data text training, and determining the first word characteristic value of each word of the data item of the field to be recognized as the word characteristics of the data item of the field to be recognized.

For each word in the data item of the field to be recognized, converting each word of the data item of the field to be recognized into a first word characteristic value according to the word characteristics obtained by the medical data text training, and determining the first word characteristic value of each word of the data item of the field to be recognized as the word characteristics of the data item of the field to be recognized. In the method, a deep learning method can be used to train a medical data text (such as a common medical noun, an organization name, a drug name, a medical insurance name, a disease name, and the like) to obtain the character features of each character in the medical field. In addition, the medical data text can be trained to obtain the word characteristics of each word segmentation in the medical field. The expression form of the character features and the word features may be a feature vector.

12) And extracting single characters or multiple characters adjacent to the first target character and in a preset range of the first target character to form a first character group, converting the first character group into a first character group characteristic value according to character characteristics obtained by medical data text training, determining the first character group characteristic value as position characteristics of the first target character, and determining the position characteristics of each first target character as inter-character position characteristics of the data item of the field to be recognized.

And taking each word in the data of the field to be recognized as a first target word respectively, and extracting a first word group consisting of single words or multiple words which are adjacent to the first target word and in a preset range of the first target word. And converting the first character set into a first character set characteristic value according to character characteristics obtained by medical data text training, and determining the first character set characteristic value as the position characteristics of the first target character. After the position feature of each first target word of the data item of the field to be identified is obtained, the position feature corresponding to each first target word is determined as the inter-word position feature of the data item of the field to be identified. The single character or the multiple characters in the preset range are windows for setting and extracting the characters, and the windows can be set according to actual application conditions. For example, when the window is 1, based on the position of the first target word, 1 word is extracted forward, and 1 word is extracted backward, and the extracted words are respectively used as 2 first word groups; when the window is 2, based on the position of the first target character, 2 characters are extracted forward, and 2 characters are extracted backward, and the extracted characters are respectively used as 2 first character groups.

For example, the data item corresponding to the field to be recognized is "senile dementia", the extraction window is 2, when "old" is the first target word, the word is empty when extracted forward, a symbol such as "-" which is preset to indicate an empty can be used for marking, and when extracted backward, the word is "senile dementia", then "-", "senile dementia" are respectively used as the first word groups, each first word group is converted into a first word group characteristic value, and the two first word group characteristic values are determined as the position characteristic of "old"; when the word is the first target word, the word is extracted forward as 'senile' and backward as 'dementia', the word is respectively used as the first word group, each first word group is converted into a first word group characteristic value, and the first word group characteristic value is determined as the position characteristic of 'dementia'. After the position feature of each word in the data item "senile dementia" is obtained, the position feature of each word is determined as the inter-word position feature of "senile dementia is in".

13) The method comprises the steps of segmenting words of a data item of a field to be recognized, converting each segmented word of the data item of the field to be recognized into a first word feature value according to word features obtained through medical data text training, and determining each first word feature value as the word features of the data item of the field to be recognized.

In this embodiment, the data item of the field to be recognized is subjected to word segmentation processing to obtain each word segmentation, each word segmentation is converted into a first word feature value according to word features obtained by training the medical training text, and each first word feature value is determined as a word feature of the data item of the field to be recognized. The specific implementation of the word segmentation processing on the data item of the field to be recognized may be implemented by using a conventional word segmentation method, and this embodiment is not described herein again.

14) The method comprises the steps of segmenting a data item of a field to be recognized, extracting second target segmentation adjacent to a first target segmentation and within a preset range of the first target segmentation, converting the second target segmentation into a second word feature value according to word features obtained by medical data text training, determining the second word feature value as position features of the first target segmentation, and determining the position features of all the first target segmentation as inter-word position features of the data item of the field to be recognized.

In this embodiment, word segmentation processing is performed on a data item of a field to be recognized to obtain each word segmentation, a given word segmentation is used as a first target word segmentation, and a second target word segmentation which is adjacent to the first target word segmentation and is within a preset range of the first target word segmentation is extracted. And then, converting the second target word segmentation into a second word feature value according to the word features obtained by the medical data text training, determining the second word feature value as the position features of the first target word segmentation, and finally determining the position features of the first target word segmentation as the inter-word position features of the data items of the field to be recognized.

The first target word segmentation preset range refers to a window for extracting adjacent word segmentation based on the first target word segmentation, and the size of the window can be set according to actual application conditions. For example, when the window is 1, based on the position of the first target word, extracting 1 second target word forward, extracting 1 second target word backward, converting the two extracted second target words into second word feature values, and determining the second word feature values as the position features of the first target word; when the window is 2, based on the position of the first target word segmentation, 2 second target word segmentations are extracted forwards, 2 second target word segmentations are extracted backwards, the extracted 4 second target word segmentations are respectively converted into second word characteristic values, and the second word segmentation values are determined as the position characteristics of the second target word segmentation.

For example, the data item of the field to be recognized is "senile dementia", the word segmentation results are "senile", "dementia" and "symptom", the extraction window is 1, when "senile" is a first target word segmentation, the word is extracted forward and is empty- ", and when" dementia "is extracted backward, the word is used as a second target word segmentation, the two second target word segmentations are respectively converted into second word feature values, and the two second word feature values are determined as the position feature of the first target word segmentation" senile "; when the dementia is the first target word segmentation, the second target word segmentation is extracted forwards to be the aged, the second target word segmentation is extracted backwards to be the symptom, the aged and the symptom are respectively converted into second word characteristic values, and the second word characteristic values are determined to be the position characteristics of the dementia. Similarly, when the first target participle is 'symptom', the second target participle extracted forwards is 'dementia', and the fourth target participle extracted backwards is 'empty-', then the 'dementia' and the 'minus' are respectively converted into second word characteristic values, and the second word characteristic values are used as the position characteristics of the 'symptom'. After the position feature of each participle in the data item 'senile dementia' is obtained, the position feature of each participle is determined as the inter-word position feature of 'senile dementia is in'.

Through the above description, various text features of the data items of the field to be recognized, that is, the character feature, the inter-character position feature, the word feature, and the inter-word position feature, can be obtained. Wherein each feature may comprise a plurality of features. For example, the data item may correspond to a plurality of participles, and the word feature may include a word feature corresponding to each participle, and similarly, the word feature may include a word feature corresponding to each word, and the inter-word position feature may include an inter-word position feature corresponding to each word, and the like.

2) And calculating the matching degree characteristics of the data items of the fields to be recognized and each training text set.

In this embodiment, the matching degree feature of the data item of the field to be recognized and each training text set, that is, the correlation degree feature of the data item of the field to be recognized and each training text set, may also be calculated. The training text set refers to data item sets corresponding to different fields, and a data item set corresponding to one field is a training text set. For example, a training text set corresponding to a drug field [ eszolam benzathine penicillin long-acting penicillin ], and a training text set corresponding to a medical insurance field [ rural cooperative medical treatment town medical insurance business insurance ].

Specifically, the matching program features of the data item of the field to be recognized and the training text set can be obtained through calculation in the following manner:

21) and acquiring a matching value of a data item of a field to be identified and a jth data item in an ith training text set, wherein i and j are positive integers, and each training text set comprises a data item of one type.

And calculating a matching value between each data item of the field to be recognized and each data item in the training text set, so as to obtain the matching value between each data item and each data item in the training text set.

22) And calculating the matching degree value of the data item of the field to be recognized and the ith training text set according to the matching value of the data item of the field to be recognized and the jth data item in the ith training text set.

After the matching value of the data item of the field to be recognized and each data item in a certain training text set is obtained, the matching value is utilized to calculate the matching degree value of the data item of the field to be recognized and the training text set. When a plurality of training text sets exist, the matching degree value of each data item of the field to be recognized and each training text set is calculated and obtained. For example, the field to be recognized corresponds to 3 data items, and 20 training text sets exist, then the matching degree value of each data item and 20 training text sets is calculated, and the matching degree values coexist in 60 matching degree values.

Specifically, the following formula can be used for calculation:

wherein qi represents the matching degree value of the data item of the field to be recognized and the ith training text set, u_iRepresenting the correlation coefficient corresponding to the ith training text set, wij representing the matching of the data item of the field to be identified and the jth data item in the ith training text setThe value N indicates that the ith training text set includes N data items.

23) And determining the matching degree value of the data item of the field to be recognized and each training text set as the matching degree characteristic of the data item of the field to be recognized and each training text set.

And after the matching degree values of the data items of the fields to be recognized and each training text set are obtained, determining the matching degree value of the data items of the fields to be recognized and a certain training text set as the matching degree characteristic of the data items of the fields to be recognized and the training text set.

3) And combining the text characteristics of the data items of the fields to be recognized and the matching degree characteristics of the data items of the fields to be recognized and each training text set into the characteristic representation of the data items of the fields to be recognized.

And after obtaining each text feature of the data item of the field to be recognized and the matching degree feature of the data item of the field to be recognized and each training text set, forming feature representation of the data item of the field to be recognized by all the obtained text features and matching degree features.

S203: and (3) representing the characteristics of the data item of the field to be recognized by inputting a target deep learning model corresponding to the target field, and obtaining a recognition result of whether the data item of the field to be recognized is matched with the target field.

After the characteristic representation of the data item of the field to be recognized is obtained, the characteristic representation is input into a target deep learning model, and the target deep learning model outputs a recognition result of whether the data item corresponding to the characteristic representation is matched with the target field or not by recognizing the characteristic representation. Specifically, when the feature representation of the data item of the field to be identified reaches the preset similarity with the positive sample data, the data item of the field to be identified is matched with the target field to obtain an identification result; and when the characteristic representation of the data item of the field to be identified and the negative sample data reach the preset similarity, obtaining the identification result that the data item of the field to be identified and the target field are not matched.

The training process of the target deep learning model can be as follows:

1) acquiring a data item matched with the target field, generating a characteristic representation of the data item matched with the target field, and determining the characteristic representation of the data item matched with the target field as positive sample data.

In this embodiment, first, a data item matching a target field is acquired, a feature representation of the data item matching the target field is generated, and the feature representation is determined as positive sample data. The data items matched with the target field are the data items corresponding to the target field, for example, if the target field is a registration type field, and the data items corresponding to the registration type field are internal medicine, surgery, gynecology and the like, the data items are used as the data items matched with the registration type field; and if the target field is a disease field, the data item corresponding to the disease field is Alzheimer disease, senile dementia, heart disease, asthma and the like, and the data item is used as the data item matched with the disease field.

In specific implementation, firstly, text features of the data items matched with the target field are extracted, wherein the text features comprise one or more of character features, inter-character position features, word features and inter-word position features. The extraction of the character features, the inter-character position features, the word features, and the inter-word position features may be implemented by the above method, and this embodiment is not described herein again. Secondly, calculating the matching degree characteristics of the data items matched with the target fields and each training text set, wherein the specific calculation process can utilize formula (1). And finally, combining the text features of the data items matched with the target field and the matching degree features of the data direction matched with the target field and each training text set into a feature representation of the data direction matched with the target field, wherein the feature representation is positive sample data.

2) Acquiring data items which do not match the target field, generating characteristic representation of the data items which do not match the target field, and determining the characteristic representation of the data items which do not match the target field as negative sample data.

In this embodiment, first, a data item that does not match the target field is acquired, a feature representation of the data item that does not match the target field is generated, and the feature representation is determined as negative sample data. The data items that do not match the target field may be data items corresponding to non-target fields, for example, if the target field is a registration category field, then the data items that do not match the registration category field are data items corresponding to fields other than the registration category field, and if the other fields are disease fields, for example, the data items corresponding to the disease fields are alzheimer's disease, senile dementia, heart disease, asthma, etc., then the data items are regarded as data items that do not match the registration category field.

In specific implementation, firstly, text features of the data items which do not match with the target field are extracted, wherein the text features comprise one or more of character features, inter-character position features, word features and inter-word position features. The extraction of the character features, the inter-character position features, the word features, and the inter-word position features may be implemented by the above method, and this embodiment is not described herein again. Secondly, calculating the matching degree characteristics of the data items which are not matched with the target field and each training text set, wherein the specific calculation process can utilize formula (1). And finally, combining the text characteristics of the data items which do not match with the target field, the matching degree characteristics of the data items which do not match with the target field and each training text set into a characteristic representation of the data item which does not match with the target field, and determining the characteristic representation as negative sample data.

3) And training according to the positive sample data and the negative sample data to obtain a target deep learning model corresponding to the target field.

After positive sample data and negative sample data are obtained, the initial learning model is trained by taking the positive sample data and the negative sample data as training data to obtain a target deep learning model corresponding to a target field, so that the target deep learning model can identify data items similar to the positive sample data and data items similar to the negative sample data.

It should be noted that, in order to ensure that the deep learning model for training can accurately identify data items belonging to the same class as the positive sample data, the difference between the data size of the positive sample data and the data size of the negative sample data needs to be within a preset threshold range, and corresponding fields of the negative sample data should be as rich as possible.

Referring to fig. 3, which is a flowchart of another method for obtaining an identification result according to an embodiment of the present application, as shown in fig. 3, the method may include:

s301: and when the recognition mode is character matching recognition, acquiring keywords corresponding to the target field.

For some fields, the content of the data item may be fixed, and the identification mode of the fields may be character matching identification. In this embodiment, when the identification mode corresponding to the target field is character matching identification, a keyword corresponding to the target field is obtained, and the keyword represents a data item that may appear in the target field. For example, the keywords corresponding to the medical insurance field include rural cooperative medical care, town medical insurance, business insurance, and the like.

S302: and matching the data item of the field to be identified with the keyword corresponding to the target field.

And after determining the keywords corresponding to the target field, for each data item of the field to be identified, matching each data item with the keywords corresponding to the target field one by one.

S303: and if the data item of the field to be identified is matched with the keyword corresponding to the target field, acquiring an identification result of the data item of the field to be identified matched with the target field.

S304: and if the data item of the field to be identified is not matched with the keyword corresponding to the target field, acquiring an identification result that the data item of the field to be identified is not matched with the target field.

And matching each data item of the field to be identified with a keyword corresponding to the target field, and if a certain data item of the field to be identified is matched with a certain keyword corresponding to the target field, acquiring an identification result of the field to be identified, wherein the data item is matched with the target field. And if a certain data item of the field to be identified is not matched with a certain keyword corresponding to the target field, acquiring an identification result that the data item of the field to be identified is not matched with the target field. That is, for each data item of the field to be identified, matching is performed with the keyword of the target field, so as to obtain an identification result of whether each data item of the field to be identified matches with the keyword of the target field.

For example, the field to be identified corresponds to 3 data items, namely a data item a, a data item b and a data item c, wherein if the data item a is matched with the keyword corresponding to the target field, the identification result is that the data item a is matched with the target field; if the data item b is not matched with the keyword corresponding to the target field, the identification result is that the data item b is not matched with the target field; and matching the data item c with the keyword corresponding to the target field, wherein the identification result is that the data item c is matched with the target field.

Therefore, when the identification mode of the target field is character matching identification, the identification result of whether the data item of the field to be identified is matched with the target field can be obtained through the process.

Referring to fig. 4, which is a flowchart of another method for obtaining an identification result according to an embodiment of the present application, as shown in fig. 4, the method may include:

s401: and when the identification mode is the matching identification by adopting the regular rule, acquiring the regular rule corresponding to the target field.

For some fields, the content for which a data item may appear generally satisfies some specific rules, and the manner in which these fields are identified may be a regular rule match identification. In this embodiment, when the identification mode corresponding to the target field is the matching identification by the regular rule, the regular rule corresponding to the target field is obtained, so as to judge, by using the regular rule, an identification result whether the data item of the field to be identified matches the target field.

In specific implementation, the regular rule corresponding to the target field can be generated according to the characteristics of the data item corresponding to the target field. For example, for a target field being time of birth and its corresponding data item being typically xxxx year xx month xx day, the regular rule corresponding to the target field may be [ xxxx-xx-xx 8], where 8 represents the number of digits included; for a target field being a contact field, its corresponding data item is usually 1xxxxxxxxx, and the regular rule corresponding to the target field may be [1xxxxxxxxx 11], 11 represents a digit number.

S402: and judging whether the data item of the field to be identified meets the regular rule corresponding to the target field.

And after the regular rule corresponding to the target field is determined, judging whether each data item of the field to be identified meets the regular rule corresponding to the target field, so as to obtain an identification result whether each data item is matched with the target field.

Specifically, when the data item of the field to be identified meets the regular rule corresponding to the target field, executing S403; and executing S404 when the data item of the field to be identified does not meet the regular rule corresponding to the target field.

S403: and acquiring a recognition result of the matching of the data item of the field to be recognized and the target field.

S404: and acquiring an identification result that the data item of the field to be identified is not matched with the target field.

That is, when the data item of the field to be identified meets the regular rule corresponding to the target field, indicating that the data item is matched with the target field, and acquiring the identification result of the data item matched with the target field. And when the data item of the field to be identified does not meet the regular rule corresponding to the target field, indicating that the data item is not matched with the target field, and acquiring an identification result that the data item is not matched with the target field.

It can be seen that, when the identification mode corresponding to the target field is the matching identification by the regular rule, the identification result of whether each data item of the field to be identified is matched with the target field can be determined through the above process.

In addition, in a possible implementation manner, the implementation manner for determining the field to be identified is specifically: determining a target data table where the target field is located; searching a data table matched with the target data table in the data table to be identified; and determining the fields in the data table matched with the target data table as the fields to be identified. Namely, the target data table where the target field is located is determined, then the data table matched with the target data table is searched in the database comprising various data tables, and each field in the data table is determined as the field to be identified.

For example, the target field is a birth date field, which may appear in a plurality of data tables such as a registration table, an admission registration table, and the like, and if the birth date field is a field in the registration table, the registration table is determined as the target data table. And then, searching a data table matched with the registration table in the database, and if the data tables such as the registration table, the patient registration table and the like are searched, determining the two data tables as the data tables matched with the target data table, and determining a field to be identified for each field in the registration table and the patient registration table.

Based on the above method embodiments, the present application further provides a device for identifying matching fields, which will be described below with reference to the accompanying drawings.

Referring to fig. 5, which is a block diagram of an apparatus for identifying a matching field according to an embodiment of the present application, as shown in fig. 5, the apparatus may include:

a first determining unit 501, configured to determine an identification manner of a target field;

an obtaining unit 502, configured to identify, by using the identification manner, a data item of a field to be identified, and obtain an identification result of whether the data item of the field to be identified matches the target field;

a second determining unit 503, configured to determine whether the field to be identified matches the target field according to an identification result indicating whether the data item of the field to be identified matches the target field;

a third determining unit 504, configured to determine a field to be identified that matches the target field as a matching field of the target field.

In a possible implementation manner, the obtaining unit includes:

the first obtaining subunit is configured to obtain a target deep learning model corresponding to the target field when the recognition mode is deep learning model recognition; the target deep learning model corresponding to the target field is obtained by training according to positive sample data and negative sample data, wherein the positive sample data is represented by the characteristics of the data items matched with the target field, and the negative sample data is represented by the characteristics of the data items unmatched with the target field;

a generating subunit, configured to generate a feature representation of the data item of the field to be identified;

and the second obtaining subunit is configured to input the feature representation of the data item of the field to be identified into the target deep learning model corresponding to the target field, and obtain an identification result of whether the data item of the field to be identified matches the target field.

In one possible implementation, the generating subunit includes:

the extraction subunit is used for extracting text features of the data items of the fields to be identified, wherein the text features comprise one or more of character features, inter-character position features, word features and inter-word position features;

the calculating subunit is used for calculating the matching degree characteristics of the data items of the fields to be identified and each training text set;

and the forming subunit is used for forming the text features of the data items of the field to be recognized and the matching degree features of the data items of the field to be recognized and each training text set into the feature representation of the data items of the field to be recognized.

In a possible implementation manner, the calculating subunit is specifically configured to obtain a matching value between a data item of the field to be identified and a jth data item in an ith training text set, where i and j are positive integers, and each training text set includes data items of the same category;

In a possible implementation manner, the obtaining unit includes:

the third acquisition subunit is used for acquiring the keywords corresponding to the target field when the identification mode is character matching identification;

the matching subunit is used for matching the data item of the field to be identified with the keyword corresponding to the target field;

a fourth obtaining subunit, configured to obtain, if the data item of the field to be identified is matched with the keyword corresponding to the target field, an identification result that the data item of the field to be identified is matched with the target field;

and the fifth acquiring subunit is configured to acquire, if the data item of the field to be identified does not match the keyword corresponding to the target field, an identification result that the data item of the field to be identified does not match the target field.

In a possible implementation manner, the obtaining unit includes:

a sixth obtaining subunit, configured to obtain a regular rule corresponding to the target field when the identification manner is a regular rule matching identification;

the judging subunit is used for judging whether the data item of the field to be identified meets the regular rule corresponding to the target field;

a seventh obtaining subunit, configured to obtain, when a determination result of the determining subunit indicates that the data item of the field to be identified meets a regular rule corresponding to the target field, an identification result that the data item of the field to be identified is matched with the target field;

and the eighth acquiring subunit is configured to acquire, when the judgment result of the judging subunit is that if the data item of the field to be identified does not satisfy the regular rule corresponding to the target field, an identification result that the data item of the field to be identified is not matched with the target field.

In a possible implementation manner, the second determining unit is specifically configured to, in the recognition result of whether the randomly selected multiple data items in the field to be recognized match the target field, determine that the field to be recognized matches the target field if the recognition result matching the target field is greater than the recognition result not matching the target field, and determine that the field to be recognized does not match the target field if the recognition result matching the target field is less than or equal to the recognition result not matching the target field.

In one possible implementation, the apparatus further includes:

the fourth determining unit is used for determining a target data table where the target field is located;

the searching unit is used for searching a data table matched with the target data table in a data table to be identified;

and the fifth determining unit is used for determining the field in the data table matched with the target data table as the field to be identified.

It should be noted that, implementation of each unit in this embodiment may refer to the above method embodiment, and this embodiment is not described herein again.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a terminal device, the instructions cause the terminal device to execute the method for identifying a matching field.

The embodiment of the application provides a device for identifying a matching field, which comprises: the device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the method for identifying the matching field. Based on the method, when the matching field is identified, whether the data item corresponding to the field to be identified is matched with the target field is determined, and whether the field to be identified is matched with the target field is determined according to the identification result of whether the data item corresponding to the field to be identified is matched with the target field. Because the expression forms of the field to be recognized and the target field are not uniform, the field to be recognized and the target field cannot be directly matched, and the data item which can represent the field to be recognized is matched with the target field, so that the recognition of the matched field is realized. In addition, different identification modes are adopted for different target fields, and the efficiency of identifying the data items is improved.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system or the device disclosed by the embodiment, the description is simple because the system or the device corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of identifying matching fields, the method comprising:

determining an identification mode of a target field;

2. The method according to claim 1, wherein the identifying the data item in the field to be identified by using the identification manner and obtaining an identification result of whether the data item in the field to be identified matches the target field comprises:

generating a feature representation of a data item of a field to be identified;

3. The method of claim 2, wherein generating the feature representation of the data item for the field to be identified comprises:

4. The method of claim 3, wherein the text characteristics of the data items of the field to be identified comprise any one or more of the following:

5. The method according to claim 3, wherein the calculating the matching degree characteristic of the data items of the field to be recognized and each training text set comprises:

6. The method of claim 2, wherein the training process of the target deep learning model corresponding to the target field comprises:

7. The method according to claim 1, wherein the identifying the data item in the field to be identified by using the identification manner and obtaining an identification result of whether the data item in the field to be identified matches the target field comprises:

8. An apparatus that identifies matching fields, the apparatus comprising:

9. A computer-readable storage medium having stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of identifying matching fields according to any one of claims 1-7.

10. An apparatus for identifying matching fields, comprising: memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of identifying matching fields according to any of claims 1-7 when executing the computer program.