CN114220112A

CN114220112A - A method and system for job relationship extraction for character business cards

Info

Publication number: CN114220112A
Application number: CN202111544385.4A
Authority: CN
Inventors: 李佳静; 瞿签新; 林润; 汪严博; 高小涵; 张贵鹏; 张泽豪; 郝亚鑫; 曾伟豪
Original assignee: China University of Mining and Technology Beijing CUMTB
Current assignee: China University of Mining and Technology Beijing CUMTB
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-22
Anticipated expiration: 2041-12-16
Also published as: CN114220112B

Abstract

The invention discloses a person name card oriented arbitrary relationship extraction method, which comprises the following steps: step 1, obtaining a character name card picture, and preprocessing the character name card picture; step 2, extracting characters in the preprocessed character name card picture to obtain a character area; step 3, identifying three entities in the text area, wherein the three entities comprise names, working units and positions; step 4, correcting the name, the work unit and the position identified in the step 3; and 5, forming a triple for expressing the relationship of the job according to the corrected name, the work unit and the position, and storing the triple in the electronic business card database. The invention also discloses a system for extracting the occupational relationship of the character business card, thereby realizing the automatic input and storage of the occupational relationship of the business card and expanding and managing the relationship of the human veins.

Description

Person name card oriented arbitrary relationship extraction method and system

Technical Field

The invention relates to the technical field of searching and information extraction, in particular to a person business card oriented arbitrary relationship extraction method and system.

Background

The name card is an important identity information transfer carrier in current business communication and daily life, has great effects on establishing communication deepening impression and establishing preliminary business mutual trust in daily use, and is a tool with higher cost performance for improving personal influence and increasing cooperation possibility. In early years, people often converted the contents on business cards into electronic information in a manual input mode and input the electronic information into a digital storage device for storage and management. On one hand, the method is low in efficiency, and is powerless when a large amount of data needs to be processed; secondly, the cost is high, a simple entry work needs the repeated labor of personnel using a computer, a large amount of time and energy are consumed to manage and maintain the database in the later period, and the database is often inconvenient to be docked with the databases of other people. With the increasing frequency of communication nowadays, the business card entry requirement is increasing, and the automatic entry and storage of business cards by technical means is possible and urgent.

In the existing processing method for the business card, access to the contact information such as a mobile phone number, an email and the like is generally only realized. The relationship of the business card is important for organizing and managing the relationship of the human arteries. The relationship of job and task is expressed by three groups of name, work unit and position, and the prior method does not solve the following problems:

(1) identifying three entities, namely a name, a work unit and a position, from the character extraction result;

(2) correcting the character with the error identification according to the characteristics of the entity;

(3) and matching a plurality of working units and positions to generate a correct triple of the name, the working unit and the position.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method and a system for extracting the arbitrary relationship of a character-oriented business card.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a person business card oriented arbitrary relationship extraction method, which comprises the following steps:

step 1, obtaining a character name card picture, and preprocessing the character name card picture;

step 2, extracting characters in the preprocessed character name card picture to obtain a character area;

step 3, identifying three entities in the text area, wherein the three entities comprise names, working units and positions;

step 4, correcting the name, the work unit and the position identified in the step 3;

and 5, forming a plurality of triples for expressing the arbitrary role relationship according to the corrected name, the work unit and the position, and storing the triples in an electronic business card database, wherein each triplet is the name, the work unit and the position.

As a further optimization scheme of the person business card oriented arbitrary relationship extraction method, the obtaining mode of the person business card picture in the step 1 is as follows: a shot, crawler, or user offer;

the pretreatment comprises the following steps:

if the character business card picture contains a plurality of business cards, the picture is firstly divided into single character business cards, and then binarization, noise smoothing, inclination angle detection and correction processing are carried out on the single character business cards.

As a further optimization scheme of the person business card oriented arbitrary relationship extraction method, the extraction in the step 2 comprises character detection and character recognition.

As a further optimization scheme of the person business card oriented arbitrary relationship extraction method,

step 1, automatically generating a picture training test set for the preprocessed figure business card pictures;

and 2, extracting characters by adopting the figure business card picture in the automatically generated picture training test set, wherein the method for automatically generating the training test set comprises the steps of generating pictures with various fonts and different noises of Chinese characters and automatically adjusting angles of the figure business card picture to generate a plurality of test samples.

As a further optimization scheme of the person business card oriented arbitrary relationship extraction method, in step 3, three entities, namely a person name, a work unit and a position, are identified based on a named entity identification method, and when more than two entities are contained in the same text area, the entities are divided into single entities by using a Chinese lexical tool.

As a further optimization scheme of the person business card oriented arbitrary relationship extraction method, in the step 4, the correction method is as follows:

firstly, for the identified names, under the condition that the corresponding name pinyin exists in the picture of the character name card, Chinese characters with the same pinyin and the closest font are obtained from a Chinese character pinyin library for correction; under the condition that pinyin does not exist, selecting the Chinese character with the closest font by utilizing font similarity measurement to correct;

judging whether the identified working unit is a logo according to the position and the font of the identified working unit, and calling a logo identification algorithm to identify and correct the logo if the identified working unit is the logo; if the work unit is not a logo but contains English, Pinyin or address information of the work unit, using the English, Pinyin or address information as input, and calling an interface of a search engine to search and obtain a correct name of the work unit for correction; if the information is not contained, firstly, a language model is utilized to obtain characters, and then the characters with the closest character patterns are selected by utilizing the similarity measurement of the character patterns in the characters for correction;

selecting the job name with the minimum editing distance for the identified job according to the job dictionary library for correction; if the distance between the job names in the dictionary is larger than the preset threshold value, the corrected work unit name and the job to be corrected are input into the language model together to obtain the most probable character, and then the character type closest to the Chinese character is selected in the character by utilizing the character type similarity measurement for correction.

As a further optimization scheme of the person business card oriented arbitrary relation extraction method, in step 5, for a plurality of working units and positions, the working units and the positions are paired according to the proximity relation of the positions; and if a certain position has no adjacent work units in the position, the identified logo is used as the work unit corresponding to the position.

An extraction system for the relationship between the human name card and the job of the character comprises

The picture training test set unit is used for storing character name card pictures with various fonts and different noises, containing Chinese characters, and automatically adjusting angles of the character name card pictures to generate character name card pictures of a plurality of test samples;

the text knowledge base unit is used for storing a Chinese character pinyin base, a stroke order base and a dictionary of positions and unit names;

the character extraction unit is used for extracting characters in the character name card picture, obtaining character extraction results and outputting the character extraction results to the entity recognition unit, wherein the character extraction results comprise character areas;

the entity identification unit is used for identifying three entities, namely the name, the work unit and the position in the character extraction result; when the same character area contains two or more entities, the Chinese lexical tool is used for dividing the same into single entities;

the entity correcting unit is used for correcting the parts of the identified names, the identified working units and the identified positions, the confidence degrees of which are lower than the preset values;

the system comprises an arbitrary relationship generating unit, a database and a processing unit, wherein the arbitrary relationship generating unit is used for generating a plurality of triples of < person names, work units and positions > and storing the triples in the database;

the entity correction unit comprises a name correction subunit, a work unit correction subunit and a position correction subunit:

a name correction subunit, for correcting the recognized name by using the Chinese character with the same pinyin and the closest character pattern obtained from the Chinese character pinyin library under the condition that the corresponding name pinyin exists in the character name card picture; under the condition that pinyin does not exist, selecting the Chinese character with the closest font by utilizing font similarity measurement to correct;

the work unit correction subunit judges whether the identified work unit is a logo according to the position and the font of the work unit, and if the work unit is the logo, a logo identification algorithm is called to identify and correct the work unit; if the work unit is not a logo but contains English, Pinyin or address information of the work unit, using the English, Pinyin or address information as input, and calling an interface of a search engine to search and obtain a correct name of the work unit for correction; if the information is not contained, firstly, a language model is utilized to obtain characters, and then the characters with the closest character patterns are selected by utilizing the similarity measurement of the character patterns in the characters for correction;

the position correcting subunit is used for selecting the position name with the minimum editing distance for the identified position according to the position dictionary library for correction; if the distance between the job names in the dictionary is larger than the preset threshold value, the corrected work unit name and the job to be corrected are input into the language model together to obtain the most probable character, and then the character type closest to the Chinese character is selected in the character by utilizing the character type similarity measurement for correction.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

the method and the system can automatically realize the following functions:

(3) matching a plurality of working units and positions to generate a plurality of correct < person name, working unit, position > triplets;

therefore, the accuracy rate of extraction of the arbitrary relationship is improved, the automatic input and storage of the arbitrary relationship in the character business card are realized, and the viewing rate and the propagation of the electronic business card are improved. Based on the extracted arbitrary relationship, the relationship of the human pulse can be managed and expanded.

Drawings

FIG. 1 is a flow chart of a person-card oriented method for extracting an occupational relationship;

FIG. 2 is a block diagram of a system for extracting membership functions for character cards;

FIG. 3 is a block diagram of a text extraction unit;

fig. 4 is a structural diagram of an entity correcting unit.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

as shown in fig. 1, a method for extracting an arbitrary relationship oriented to a character business card includes the following steps:

step 1, obtaining a picture of a character name card, and preprocessing the picture;

step 2, extracting characters in the picture;

step 3, identifying three entities of the name, the work unit and the position of the extracted character area;

step 4, correcting the name, the work unit and the position identified in the step 3 when the character identification confidence coefficient is lower than a threshold value;

and 5, forming a plurality of triples of the name, the work unit and the position, and storing the triples in a database.

Wherein the picture of the business card of step 1 can be obtained by shooting, from a network using a crawler, or provided by a user. The picture preprocessing step comprises the following steps:

(1) if the picture contains a plurality of business cards, firstly, the picture is divided into single character business cards;

(2) and carrying out binarization, noise smoothing, inclination angle detection, correction and other processing on the picture.

Wherein, the step 2 of extracting the characters in the picture comprises two steps of character detection and character recognition. The character detection can adopt a detection model based on image segmentation, such as DB, and the like to judge whether each pixel belongs to a text target and the connection condition between the pixel and the surrounding pixels, and then integrates the results of adjacent pixels into a text box; the word recognition algorithm may use the CRNN recognition model, including convolutional layer CNN, cyclic layer RNN, and transcription layer CTC.

Before step 2, a step of automatically generating a picture training test set to train the character extraction method is also included. The method for automatically generating the training test set comprises the steps of generating pictures of 3000 Chinese character common characters with different noises including Song style, regular style, clerical script, black body and the like, and automatically adjusting angles of the pictures of the character name cards to generate a plurality of test samples.

In step 3, three entities, namely a name, a work unit and a position, are identified based on the named entity identification method, wherein the named entity identification method further comprises the step of dividing the named entity into single entities by using a Chinese lexical tool when two or more entities are contained in the same area, for example, when the CEO of Fujian Kogaku consult Co., Ltd comprises two entities, namely the work unit and the position, and the lexical tool can adopt a Baidu LAC Chinese lexical analysis tool. The method for named entity recognition can use a three-layer model comprising an embedding layer, a BilSTM layer and a decoding CRF layer.

In step 4, the correction method is as follows:

(1) for the identified name, under the condition that the name card has pinyin corresponding to the name, Chinese characters with the same pinyin and the closest font are obtained from a Chinese character pinyin library for correction; and under the condition that pinyin does not exist, selecting the Chinese character with the closest font by utilizing the font similarity measurement to correct. Wherein the font similarity can be calculated based on the edit distance of IDS (Ideographic Description sequence).

(2) And judging whether the identified work unit is a logo (logo) according to the information such as the position, the font and the like of the identified work unit. If the logo (logo) is the logo (logo), calling a logo (logo) recognition algorithm for recognition and correction; if the logo is not a common logo (logo) but contains English, Pinyin or address information of a working unit, using the English, Pinyin or address information as input, and calling an interface of a search engine to search and obtain a correct name of the working unit for correction; if the information does not contain the information, the most probable character is obtained by using the language model, and then the character type closest to the Chinese character is selected from the most probable character by using the character type similarity measurement for correction. Wherein the language model may use a BERT model and the glyph similarity may be calculated based on IDS.

(3) Selecting the job name with the minimum editing distance according to the job dictionary library for correcting the identified job; if the distances of the job names in the dictionary are all larger than the threshold value, the corrected names of the working units and the jobs to be corrected are input into the language model together to obtain the most probable characters, and then the characters with the closest font are selected from the most probable characters by utilizing the font similarity measurement for correction. Wherein the language model may use a BERT model and the glyph similarity may be calculated based on IDS.

In step 5, there may be a plurality of work units and positions, and the work units and the positions are paired according to the proximity relation of the positions. And if a certain position has no adjacent working units in position, the identified logo (logo) is used as the working unit corresponding to the position.

As shown in fig. 2, an arbitrary relationship extraction system for a character card includes the following components:

picture training test set unit: the common fonts including 3000 Chinese character common characters comprise different noise pictures such as a song style, a regular script, an clerical script, a black body and the like, and the pictures of the character business card are automatically subjected to angle adjustment to generate picture data such as a plurality of test samples and the like;

a text knowledge base unit: the Chinese character input method comprises a Chinese character pinyin library, a stroke order library, dictionaries of positions, unit names and the like;

a character extraction unit: the extraction of characters in the picture of the business card is realized;

an entity identification unit: realizing the identification of three entities, namely the name, the work unit and the position in the character extraction result; when two or more entities are contained in the same region, the entities are divided into single entities by using a Chinese lexical tool, and the lexical tool can adopt a hundredth LAC Chinese lexical analysis tool. The named entity identification method can use a three-layer model consisting of an embedding layer, a BilSTM layer and a decoding CRF layer.

An entity correction unit: the method and the device realize the correction of the parts with low confidence coefficient of the identified names, the work units and the positions.

An arbitrary relationship generating unit: several triplets of < person name, work unit, job position > are generated and stored in a database.

As shown in fig. 3, the word extraction unit includes a detection subunit and an identification subunit, which respectively implement word detection and word identification; the character detection can adopt a detection model based on image segmentation, such as DB, and the like to judge whether each pixel belongs to a text target and the connection condition between the pixel and the surrounding pixels, and then integrates the results of adjacent pixels into a text box; the word recognition algorithm may use the CRNN recognition model, including convolutional layer CNN, cyclic layer RNN, and transcription layer CTC.

As shown in fig. 4, the entity modification unit includes a name modification subunit, a work unit modification subunit and a position modification subunit:

(1) the name correction subunit corrects the recognized name by using the Chinese character pinyin which has the same pinyin and the closest character pattern and is obtained from the Chinese character pinyin library under the condition that the corresponding name pinyin exists in the name card; under the condition that pinyin does not exist, selecting the Chinese character with the closest font by utilizing font similarity measurement to correct; wherein the font similarity can be calculated based on the edit distance of IDS (Ideographic Description sequence).

(2) The work unit correction subunit judges whether the identified work unit is a logo (logo) or not according to the information such as the position, the font and the like of the identified work unit, and if the identified work unit is the logo (logo), the work unit correction subunit calls a logo (logo) identification algorithm to identify and correct the logo (logo); if the logo is not a common logo (logo) but contains English, Pinyin or address information of a working unit, using the English, Pinyin or address information as input, and calling an interface of a search engine to search and obtain a correct name of the working unit for correction; if the information does not contain the information, the most probable character is obtained by using the language model, and then the character type closest to the Chinese character is selected from the most probable character by using the character type similarity measurement for correction. Wherein the language model may use a BERT model and the glyph similarity may be calculated based on IDS.

(3) And the position correction subunit selects the position name with the minimum editing distance for the identified position according to the position dictionary library for correction. If the distances of the job names in the dictionary are all larger than the threshold value, the corrected names of the working units and the jobs to be corrected are input into the language model together to obtain the most probable characters, and then the characters with the closest font are selected from the most probable characters by utilizing the font similarity measurement for correction. Wherein the language model may use a BERT model and the glyph similarity may be calculated based on IDS.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A person business card oriented arbitrary relationship extraction method is characterized by comprising the following steps:

2. The method for extracting human business card oriented relationship as claimed in claim 1, wherein the human business card picture of step 1 is obtained by: a shot, crawler, or user offer;

the pretreatment comprises the following steps:

3. The method as claimed in claim 1, wherein the extraction in step 2 comprises text detection and text recognition.

4. The method for extracting human business card-oriented occupational relationship according to claim 1, wherein,

5. The method as claimed in claim 1, wherein in step 3, three entities of name, work unit and position are identified based on the named entity identification method, and when more than two entities are included in the same text area, the entities are divided into single entities by using a Chinese lexical tool.

6. The method for extracting human business card-oriented occupational relationship as claimed in claim 1, wherein in step 4, the correction method comprises:

7. The method for extracting human business card-oriented occupational relationship according to claim 6, wherein in the step 5, for the presence of a plurality of work units and positions, the work units and positions are paired according to the proximity relationship of the positions; and if a certain position has no adjacent work units in the position, the identified logo is used as the work unit corresponding to the position.

8. A system for extracting the relationship between the person and the name card includes