CN111310448B

CN111310448B - Data supplementing method, system, device and storage medium

Info

Publication number: CN111310448B
Application number: CN202010085091.9A
Authority: CN
Inventors: 李孟柱
Original assignee: Jiangsu Manyun Software Technology Co Ltd
Current assignee: Jiangsu Manyun Software Technology Co Ltd
Priority date: 2020-02-10
Filing date: 2020-02-10
Publication date: 2023-10-31
Anticipated expiration: 2040-02-10
Also published as: CN111310448A

Abstract

The invention relates to the technical field of data processing and provides a data supplementing method, a system, equipment and a storage medium. The data supplementing method comprises the following steps: acquiring a plurality of pieces of data to be supplemented, which are associated with the identification field, according to the identification field; identifying the languages of each piece of data to be supplemented through the language identification model, and shunting the data to be supplemented of the identified languages to a data pool of the corresponding languages; traversing the identification fields of the existing data items in the data pool according to the identification fields of the data to be supplemented in the data pool of each language to obtain a matching result; and supplementing the matched data to be supplemented into the existing data entry, and adding the unmatched data to be supplemented into the created data entry. The invention carries out language identification on the data to be supplemented, and matches and compares the data to be supplemented with the existing data of the corresponding language, thereby realizing automatic supplementation of the data to be supplemented into the data items of the corresponding language.

Description

Data supplementing method, system, device and storage medium

Technical Field

The present invention relates to the field of data processing technology, and in particular, to a data supplementing method, system, device, and storage medium.

Background

The internet company needs to frequently supplement, update, modify and the like the data information on the website, so that the data displayed on the website is kept up to date, accurate and comprehensive. With the rapid development of the internet industry, many internet companies are gradually expanding overseas services while developing home services. Therefore, it is necessary to supplement the existing system with effective information of different languages.

In the prior art, chinese data can be analyzed and processed through the existing algorithm to finish the pre-processing work before supplement, but the post-processing still depends on manual auditing judgment; for data of other languages, especially small languages, the data is more dependent on manual verification, and classification information needs to be manually identified, so that the workload is large, the efficiency is low, and the accuracy is relatively low.

At the moment of increasing information and increasing resources, how to quickly and accurately supplement effective information of different languages to the existing system becomes a problem to be solved.

It should be noted that the information applied in the above background section is only for enhancing understanding of the background of the present invention and is therefore included in a large amount of useful information for different languages and thus may not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

In view of the above, the present invention provides a data supplementing method, system, device and storage medium, which can identify languages of data to be supplemented, and match and compare the data to be supplemented with existing data of corresponding languages, so as to automatically supplement the data to be supplemented into data entries of corresponding languages.

One aspect of the present invention provides a data augmentation method comprising the steps of: acquiring a plurality of pieces of data to be supplemented which are associated with the identification field according to the identification field; identifying the languages of each piece of data to be supplemented through the language identification model, and shunting the data to be supplemented of the identified languages to a data pool of the corresponding languages; traversing the identification fields of the existing data items in the data pool according to the identification fields of the data to be supplemented in the data pool of each language to obtain a matching result; when the identification field of data to be supplemented is matched with the identification field of an existing data item, supplementing the data to be supplemented into the existing data item; and when the identification field of the data to be supplemented is not matched with the identification field of each existing data item, creating a data item according to the identification field of the data to be supplemented, and adding the data to be supplemented into the created data item.

In some embodiments, in the data pool of each language, a mapping relationship of mutual indexes is established between data entries matched with identification fields, and after creating a data entry according to the identification field of the data to be supplemented, the method further includes the steps of: judging whether the identification field of the created data item is matched with the identification fields of the data items in the rest data pools, if so, establishing a mapping relation of mutual indexes between the created data item and the data item matched with the identification field.

In some embodiments, the language identification model is constructed based on a language model N-Gram, and the step of identifying the language of each piece of data to be supplemented by the language identification model includes: word segmentation is carried out on data to be supplemented, and a vocabulary set of the data to be supplemented is obtained; inputting each vocabulary in the vocabulary set into the language identification model to obtain the probability that each vocabulary belongs to each language; according to the probability that each vocabulary belongs to each language, the probability that the data to be supplemented belongs to each language is obtained; and determining the languages of the data to be supplemented according to the relation between the probability that the data to be supplemented belongs to each language and a preset probability threshold.

In some embodiments, in the step of traversing the identification field of the existing data entry in the data pool according to the identification field of the data to be supplemented, when the similarity between the identification field of the data to be supplemented and the identification field of the existing data entry is greater than a first similarity threshold, a matching result of the identification field of the data to be supplemented and the identification field of the existing data entry is obtained; when the similarity between the identification field of the data to be supplemented and the identification field of each existing data item is smaller than a second similarity threshold value, a matching result that the identification field of the data to be supplemented is not matched with the identification field of each existing data item is obtained; the first similarity threshold is greater than the second similarity threshold.

In some embodiments, the data augmentation method further comprises the steps of: when the maximum similarity is between the second similarity threshold and the first similarity threshold in the similarity between the identification field of the data to be supplemented and the identification field of each existing data item, pushing the data to be supplemented to a queue to be checked; and pushing the data to be supplemented, which are not identified in the language, to the queue to be checked in the step of identifying the language of each piece of data to be supplemented through the language identification model.

In some embodiments, the data to be augmented comprises logistical data, and the identification field comprises any one or more of: commercial tenant, driver, vehicle, address; and the data to be augmented associated with the identification field includes data to be augmented that exactly matches the identification field and data to be augmented that implicitly matches the identification field.

In some embodiments, in the step of supplementing the data to be supplemented into the existing data entry, normalizing the data to be supplemented, and merging the data to be supplemented with the existing data entry to perform deduplication; and normalizing the data to be supplemented in the step of adding the data to be supplemented to the created data entry.

Another aspect of the invention provides a data augmentation system comprising: the data acquisition module is used for acquiring a plurality of pieces of data to be supplemented, which are associated with the identification field, according to the identification field; the language identification module is used for identifying the languages of each piece of data to be supplemented through the language identification model and shunting the data to be supplemented of the identified languages to a data pool of the corresponding languages; the matching comparison module is used for traversing the identification fields of the existing data items in the data pool according to the identification fields of the data to be supplemented in the data pool of each language to obtain a matching result; the data supplementing module is used for supplementing the data to be supplemented into the existing data item when the identification field of the data to be supplemented is matched with the identification field of the existing data item; and the data newly-adding module is used for creating a data item according to the identification field of the data to be supplemented and newly adding the data to be supplemented into the created data item when the identification field of the data to be supplemented is not matched with the identification field of each existing data item.

A further aspect of the invention provides a data augmentation device comprising: a processor; a memory having stored therein executable instructions of the processor; wherein the processor is configured to perform the steps of the data augmentation method described above via execution of the executable instructions.

A further aspect of the invention provides a computer readable storage medium storing a program which when executed implements the steps of the data augmentation method described above.

Compared with the prior art, the invention has the beneficial effects that at least:

the language identification model is used for automatically identifying the languages of the data to be supplemented, the identification speed is high, the identification accuracy is high, and the data to be supplemented of the identified languages is shunted to the data pool of the corresponding languages so as to be matched and compared in a targeted manner; the method comprises the steps of obtaining a matching result of the data to be supplemented and the existing data by matching and comparing an identification field of the data to be supplemented with an identification field of the existing data entry, supplementing the matched data to be supplemented into the existing data entry, and realizing supplementation of the existing data; creating data items for the unmatched data to be supplemented, and realizing the new addition of the data; the data supplementing method can realize automatic supplementation of most data, reduce manual participation and improve the processing efficiency and accuracy of data supplementation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 shows a flow chart of the steps of a data augmentation method in an embodiment of the present invention;

FIG. 2 is a flow chart of a model of language identification in an embodiment of the invention;

FIG. 3 is a flow chart of the architecture of the data augmentation method in an embodiment of the present invention;

FIG. 4 shows a block diagram of a data augmentation system in an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of a data augmentation device in an embodiment of the present invention; and

fig. 6 shows a schematic configuration of a computer-readable storage medium in an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus a repetitive description thereof will be omitted.

Fig. 1 shows the main steps of the data augmentation method in an embodiment, with reference to fig. 1, in some embodiments the data augmentation method mainly comprises: in step S10, according to the identification field, acquiring a plurality of pieces of data to be supplemented associated with the identification field; in step S20, recognizing the language of each data to be supplemented through the language recognition model, and splitting the data to be supplemented of the recognized language into a data pool of the corresponding language; in step S30, in the data pool of each language, traversing the identification field of the existing data entry in the data pool according to the identification field of the data to be supplemented, and obtaining a matching result; in step S40, when the identification field of the data to be supplemented matches with the identification field of an existing data entry, supplementing the data to be supplemented into the existing data entry; and in step S50, when the identification field of the data to be supplemented is not matched with the identification field of each existing data item, creating a data item according to the identification field of the data to be supplemented, and adding the data to be supplemented to the created data item.

Step S10 may obtain the data to be augmented associated with the identification field by a crawler technique. The data to be augmented may be logistic data and the identification field may include any one or more of the following: merchant, driver, vehicle, address. Wherein, the merchant may include various aspects of merchant information such as the name of the merchant, the class of the merchant, the business hours of the merchant, the telephone of the merchant, etc.; the driver may include the name of the driver, the identification number of the driver, the phone of the driver, the age of the driver, etc.; the vehicle may include a license plate number of the vehicle, a model number of the vehicle, a loading tonnage of the vehicle, and the like; the address may be a business address of the merchant, a registration address of the driver, a registration address of the vehicle, and the like. The identification field is used for distinguishing different data to be supplemented, and the data to be supplemented associated with the identification field comprises the data to be supplemented which is matched with the identification field accurately and the data to be supplemented which is matched with the identification field in a fuzzy manner. Specifically, when the data to be supplemented is obtained according to the identification field, the identification field can be set according to actual needs to obtain the data to be supplemented with different matching degrees with the identification field. When the data to be supplemented, which is accurately matched with the identification field, is required to be obtained, the identification field can be accurately processed, and the data crawling is performed by adopting the identification fields such as the name of a merchant, the identification card number of a driver, the license plate number of a vehicle and the like; when the large-range data to be supplemented, which is matched with the identification field in a fuzzy way, needs to be obtained, the identification field can be subjected to fuzzy processing, and the data crawling is performed by adopting the identification fields such as the class of merchants, the driving age of drivers, the loading tonnage of vehicles and the like. In different implementation scenarios, corresponding data to be supplemented can be obtained through different identification fields according to the data supplementing needs. In some embodiments, a portion of the data to be supplemented may also be obtained by the operator.

Step S20 realizes automatic identification of the language to which the data to be supplemented belongs through a language identification model, and has high identification speed and high identification accuracy. In this embodiment, the language recognition model is constructed according to the language model N-Gram and trained based on the maximum likelihood estimation method. The N-Gram is an N-Gram model, the probability of input data can be judged, a language identification model is built through an open source item language-detection based on the N-Gram, the input of one piece of data can be realized, and the probability that the piece of data belongs to different languages is output. Training is required before the language identification model is put into use, and in this embodiment, the language identification model is trained based on a Maximum Likelihood Estimation (MLE). The training text adopts standard texts of different languages, for example, can be derived from an open source corpus of Github (a host platform facing open source and private software projects), and the main basis of the corpusText data after pre-normalization processing may also be employed in the articles above Wikipedia and Twitter. The training is mainly divided into two steps, wherein the first step is to divide words of articles in different languages by training texts and perform frequency statistics. And secondly, comparing and analyzing the input text with the training result set, and calculating the probability of the input text belonging to a certain language. Wherein, predicting the probability of a text occurring in a certain language is based on a large number of texts in front, and the maximum likelihood probability is calculated. This text is preceded by the assumption of a text sequence: s=w ₁ ,w ₂ ,...,w _T Its probability can be expressed as:then the probability of occurrence of a text is predicted as: p (w) _t |w ₁ ,w ₂ ,...,w _t-1 ) Simplifying this, considering that the probability of the current word is only related to the first few limited words, a simplified N-gram model is obtained: p (w) _t |w ₁ ,w ₂ ,...,w _t-1 )≈p(w _t |w _t-n+1 ,...,w _t-1 ). The above formulas are all based on existing technology and therefore will not be shown.

Fig. 2 shows a training process S20a and a recognition process S20b of a language recognition model in an embodiment, and referring to fig. 2, in the training process S20a, an N-Gram phrase, such as a binary phrase, a ternary phrase, etc., is obtained by word segmentation of a training text, and the language recognition model is input for parameter training. The trained language identification model can be used for S20b to identify languages of data to be supplemented, the data to be supplemented is input into the trained language identification model, the data to be supplemented is output to belong to different language probability distributions, and the language identification model is determined to belong to a certain language through a preset threshold. In one embodiment, the step of identifying a language of the data to be augmented by the language identification model comprises: word segmentation is carried out on data to be supplemented, and a word set of the data to be supplemented is obtained; the N-Gram phrase of the data to be supplemented can be obtained by word segmentation of the data to be supplemented according to the requirement. Then, each vocabulary in the vocabulary set is input into a language identification model, and the probability that each vocabulary belongs to each language is obtained; namely, each N-Gram phrase of the data to be supplemented is input into a language identification model, and the probability that each N-Gram phrase belongs to different languages is calculated by the language identification model. Then, according to the probability that each vocabulary belongs to each language, the probability that the data to be supplemented belongs to each language is obtained; the probability that the data to be supplemented belongs to each language is the joint probability, such as probability product, of the probability that each N-Gram phrase belongs to different languages. Finally, determining the language of the data to be supplemented according to the relation between the probability that the data to be supplemented belongs to each language and a preset probability threshold. For example, the preset probability threshold is set to be 90%, the probability that the language identification model obtains that the data to be supplemented belongs to Chinese is 92% after calculation, and if the probability exceeds the preset probability threshold by 90%, the language of the data to be supplemented is considered to be Chinese. In some embodiments, the probabilities that the data to be supplemented belong to different languages may be ordered, and the language corresponding to the highest probability is taken as the language of the data to be supplemented.

In one embodiment, 10000 samples are used to test the language identification model, and the languages are chinese simplified, chinese traditional, german, english, russian, japanese, and korean. The test results are shown in table 1 below:

table 1:

language type	Accuracy rate of
		Chinese character simplified body	92.3％
Chinese traditional form	90.1％
		German language	87％
English language	89.1％
		Russian language	88.14％
Japanese language	82.2％
		Korean language	84％

According to the analysis of the language identification result, the accuracy of the Chinese simplified form, the Chinese traditional form, the German, the English and the Russian can reach about 90%, and the accuracy of the Japanese and the Korean is slightly low due to the influence of the Chinese traditional form, but the identification degree is also high.

After the language of the data to be supplemented is identified in step S20, the data to be supplemented is split into the data pools of the corresponding languages, so that the data to be supplemented is supplemented into the data entries of the corresponding languages after the subsequent matching comparison. Specifically, the system database comprises a data pool storing data information of a plurality of languages, and the system can display pages of different languages according to the user needs according to the data information of different languages stored in different data pools. For example, a page of a corresponding language may be presented according to a user's location, or a page of a corresponding language may be presented according to a user's selection. When the data to be supplemented is acquired, the data to be supplemented in each language is acquired, and the data supplement in the data pools in each language can be realized by shunting the data to be supplemented into the data pools in the corresponding language.

Step S30 can check whether the data items corresponding to the data to be supplemented are stored in the data pool of each language or not through matching and comparing the data to be supplemented with the existing data items. Each data item is stored by taking the identification field as an identification, and the corresponding data item can be retrieved according to the identification field. And traversing the identification field of the existing data item in the data pool according to the identification field of the data to be supplemented in the data pool of each language, and obtaining a matching result of the identification field of the data to be supplemented and the identification field of the existing data item. In some embodiments, when the similarity between the identification field of the data to be supplemented and the identification field of an existing data item is greater than a first similarity threshold, a matching result is obtained in which the identification field of the data to be supplemented is matched with the identification field of the existing data item; when the similarity between the identification field of the data to be supplemented and the identification field of each existing data item is smaller than a second similarity threshold, a matching result that the identification field of the data to be supplemented is not matched with the identification field of each existing data item is obtained; wherein the first similarity threshold is greater than the second similarity threshold. The first similarity threshold and the second similarity threshold may be set as required, and typically the first similarity threshold is close to 100%, and when the similarity between the identification field of the data to be supplemented and the identification field of an existing data item is greater than the first similarity threshold, the identification field of the data to be supplemented is considered to be matched with the identification field of the existing data item, and the existing data item may be supplemented by the data to be supplemented, i.e. step S40 is performed to supplement the data to be supplemented into the existing data item. For example, in a data pool, the information content of a piece of data to be supplemented is telephone information of "ABC company", and address information of the merchant name "ABC company" is stored in the data pool, then the obtained telephone information is supplemented into the data item of "ABC company", so as to implement supplement and perfection of the existing data. The second similarity threshold is usually set smaller, for example, less than 30%, and when the identification field of the data to be supplemented does not match the identification field of each existing data entry, the data to be supplemented is considered to be a new data for the data pool of the language, so step S50 is performed to create a data entry according to the identification field of the data to be supplemented, and add the data to be supplemented to the created data entry, so as to realize the data addition. For example, data information of a merchant named "DEF company" in a certain area is acquired, and no data entry of the merchant named "DEF company" is in the corresponding data pool, and then a data entry of the merchant named "DEF company" is added. By the data supplementing method, language identification and automatic supplementation of most of data to be supplemented can be realized, manual participation is reduced, and data supplementing efficiency and accuracy are improved. The data resources of the company can be greatly enriched through automatic data supplementation, the existing data is supplemented and perfected, and new data which are not yet available are newly added.

And if the matching result is unknown, namely, the maximum similarity is between the second similarity threshold and the first similarity threshold in the similarity between the identification field of the data to be supplemented and the identification field of each existing data item, and whether the data to be supplemented is matched with the existing data item cannot be determined, the data to be supplemented is pushed to the queue to be checked, and is determined by manual checking. In the step of identifying the languages of each piece of data to be supplemented through the language identification model, the data to be supplemented without identifying the languages can be pushed to a queue to be checked and determined by manual checking. In some embodiments, when supplementing the data to be supplemented into the existing data entry in step S40, the data to be supplemented is normalized, and the data to be supplemented is merged with the existing data entry and deduplicated. When the data to be supplemented is newly added to the created data entry in step S50, the data to be supplemented is normalized.

Further, in some embodiments, in the data pool of each language, the mapping relationship of mutual indexes is established between the data entries with the matched identification fields, and after creating a data entry according to the identification field of the data to be supplemented in step S50, the method further includes: judging whether the identification field of the created data item is matched with the identification fields of the data items in the rest data pools, if so, establishing a mapping relation of mutual indexes between the created data item and the data item matched with the identification field. The data supplementing method realizes supplementing the data to be supplemented into the corresponding data items based on the language of the identified data to be supplemented. According to the embodiment, the data items with different languages but matched with the identification fields are further established with a mapping relation which can be mutually indexed, so that according to one data item, other data items with different languages matched with the identification fields can be indexed, the page display is richer, and the user can inquire and search conveniently. For example, in the chinese data pool, a data entry with a merchant name of "kender" is stored, under which detailed information about the "kender" merchant, such as the address of each branch, telephone, etc., is stored. When the data is supplemented, a data item with a merchant name of 'Kentucky Fried Chicken' (KFC for short) is newly created in the English data pool, and detailed information about 'KFC' merchants is stored under the data item. According to preset matching rules, such as matching rules between merchant names of different languages set for some popular merchants, or automatic semantic recognition and other operations, the data item with the newly created data field of KFC is queried to be matched with the data item with the identification field of Kenderstyle in the Chinese data pool, so that the data items matched with the two identification fields are established with a mapping relation of mutual indexes. Subsequently, the user can directly link to the English page KFC edition in the Kendeck edition of the Chinese page so as to enrich the display of the page and realize flexible search query. When the existing data items are supplemented in step S40, the mapping relationship of the mutual indexes is established between the existing data items and the data items in the other matched data pools, so that the establishment is not repeated during the data supplementation.

Fig. 3 shows an architecture flow of a data augmentation method in an embodiment, with reference to fig. 3, in some embodiments the data augmentation method comprises: the first step obtains data corresponding to step S10 in the above embodiment. The acquired data to be augmented may originate from crawler data and other data, such as data acquired by an operator. The second language identification corresponds to step S20 in the above embodiment. The language identification model generates a large number of language word segmentation fragments through text training in advance, counts the frequency, compares and analyzes newly input data to be supplemented based on training samples, and can obtain languages corresponding to the data to be supplemented and distributes the languages into a data pool of the corresponding languages. And step three, preprocessing, which comprises operations of self-duplication elimination, normalization and the like, removing repeated data and normalizing the data to be supplemented. The driver information is obtained in this embodiment, for example, and is preprocessed, so that four normalized data of name, mobile phone number, identification card number and address are left. The fourth step of matching comparison corresponds to step S30 in the above embodiment. In each data pool, a matching result of the identification field of the data to be supplemented and the identification field of the existing data entry is obtained through matching analysis of the data to be supplemented and the existing data in the data pool, and the data to be supplemented can be supplemented into the corresponding data entry according to the matching result so as to be displayed on line. Before the data to be supplemented is online, the operator can also perform sampling inspection according to the matching result, for example, the matching data, the unmatched data and the unknown data are respectively sampled, so that the reliability of the matching result is improved. According to the result of the spot check, the matching rule can be adjusted to carry out matching again. The last step, the data are put on line, i.e. the data to be supplemented are appended to the corresponding data entries of the corresponding data pool and presented. For the data to be supplemented of different fields, i.e. the data to be supplemented of different identification fields, the line is newly added, corresponding to step S50 in the above embodiment. The supplementary line is put on the line for the data to be supplemented of the same field, i.e. the data to be supplemented of the same identification field, corresponding to step S40 in the above-described embodiment. For conflicting data under the same identification field, such as different telephone numbers under the same driver name, a decision can be made after manual review as to whether to supplement the online.

In summary, according to the data supplementing method disclosed by the invention, the languages of the data to be supplemented are rapidly and accurately identified through the language identification model, so that the data to be supplemented is shunted into the data pool of the corresponding languages; the matched data to be supplemented is supplemented to the existing data items through matching comparison of the data to be supplemented and the existing data, so that the supplementation perfection of the existing data is realized; for the data to be supplemented of the existing data which are not matched, new data items are created, and the data is newly added; therefore, automatic supplementation of most data is realized, manual participation is reduced, and the data supplementation efficiency and accuracy are improved; and the mapping relation is established for the data items with the matching identification fields in each language, so that page display is enriched, and the query index of the associated data items is facilitated.

The embodiment of the invention also provides a data supplementing system. Fig. 4 shows the main modules of the data augmentation system of an embodiment, with reference to fig. 4, in some embodiments the data augmentation system 4 mainly comprises: a data acquisition module 401, configured to acquire, according to the identification field, a plurality of pieces of data to be supplemented associated with the identification field; the language identification module 402 is configured to identify a language of each piece of data to be supplemented through the language identification model, and split the data to be supplemented of the identified language into a data pool of the corresponding language; the matching comparison module 403 is configured to traverse, in the data pool of each language, the identification field of the existing data entry in the data pool according to the identification field of the data to be supplemented, and obtain a matching result; a data supplementing module 404, configured to supplement the data to be supplemented to an existing data entry when the identification field of the data to be supplemented matches with the identification field of the existing data entry; and a data adding module 405, configured to create a data entry according to the identification field of the data to be supplemented and add the data to be supplemented to the created data entry when the identification field of the data to be supplemented is not matched with the identification field of each existing data entry.

Each module of the data supplementing system 4 can be used for executing steps S10 to S50 described in any data supplementing method embodiment, and the language of the data to be supplemented can be quickly and accurately identified through the language identification model, so that the data to be supplemented can be split into the data pools of the corresponding languages; the matched data to be supplemented is supplemented to the existing data items through matching comparison of the data to be supplemented and the existing data, so that the supplementation perfection of the existing data is realized; for the data to be supplemented of the existing data which are not matched, new data items are created, and the data is newly added; therefore, automatic supplementation of most data is realized, manual participation is reduced, and the data supplementation efficiency and accuracy are improved; and the mapping relation is established for the data items with the matching identification fields in each language, so that page display is enriched, and the query index of the associated data items is facilitated.

The embodiment of the invention also provides a data augmentation device comprising a processor and a memory, the memory having stored therein executable instructions, the processor being configured to perform the steps of the data augmentation method of the above-described embodiment via execution of the executable instructions.

As described above, the data augmentation device of the present invention can rapidly and accurately identify the languages of the data to be augmented by the language identification model, thereby realizing the diversion of the data to be augmented into the data pool of the corresponding languages; the matched data to be supplemented is supplemented to the existing data items through matching comparison of the data to be supplemented and the existing data, so that the supplementation perfection of the existing data is realized; for the data to be supplemented of the existing data which are not matched, new data items are created, and the data is newly added; therefore, automatic supplementation of most data is realized, manual participation is reduced, and the data supplementation efficiency and accuracy are improved; and the mapping relation is established for the data items with the matching identification fields in each language, so that page display is enriched, and the query index of the associated data items is facilitated.

Fig. 5 is a schematic structural diagram of the data supplementing device in the embodiment of the present invention, and it should be understood that fig. 5 only schematically illustrates each module, and these modules may be virtual software modules or actual hardware modules, and the combination, splitting and addition of the remaining modules are all within the scope of the present invention.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" platform.

A data augmentation device (hereinafter referred to as an electronic device) 500 of the present invention is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of electronic device 500 may include, but are not limited to: at least one processing unit 510, at least one memory unit 520, a bus 530 connecting the different platform components (including the memory unit 520 and the processing unit 510), a display unit 540, etc.

Wherein the storage unit stores a program code executable by the processing unit 510 such that the processing unit 510 performs the steps of the data augmentation method described in the above embodiments. For example, the processing unit 510 may perform the steps shown in fig. 1-3.

The storage unit 520 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 5201 and/or cache memory unit 5202, and may further include Read Only Memory (ROM) 5203.

The storage unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 530 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 500 may also communicate with one or more external devices 600 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 500, and/or any device (e.g., router, modem, etc.) that enables the electronic device 500 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 550. Also, electronic device 500 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 560. The network adapter 560 may communicate with other modules of the electronic device 500 via the bus 530. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 500, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.

The embodiment of the present invention also provides a computer-readable storage medium storing a program which, when executed, implements the steps of the data augmentation method described in the above embodiment. In some possible implementations, the aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of the data augmentation method described in the above embodiments, when the program product is run on the terminal device.

As described above, the computer-readable storage medium of the present invention can rapidly and accurately identify languages of data to be supplemented through the language identification model, and realize splitting the data to be supplemented into data pools of corresponding languages; the matched data to be supplemented is supplemented to the existing data items through matching comparison of the data to be supplemented and the existing data, so that the supplementation perfection of the existing data is realized; for the data to be supplemented of the existing data which are not matched, new data items are created, and the data is newly added; therefore, automatic supplementation of most data is realized, manual participation is reduced, and the data supplementation efficiency and accuracy are improved; and the mapping relation is established for the data items with the matching identification fields in each language, so that page display is enriched, and the query index of the associated data items is facilitated.

Fig. 6 is a schematic structural view of a computer-readable storage medium of the present invention. Referring to fig. 6, a program product 700 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A data augmentation method comprising the steps of:

acquiring a plurality of pieces of data to be supplemented which are associated with the identification field according to the identification field;

identifying the languages of each piece of data to be supplemented through the language identification model, and shunting the data to be supplemented of the identified languages to a data pool of the corresponding languages;

traversing the identification fields of the existing data items in the data pool according to the identification fields of the data to be supplemented in the data pool of each language to obtain a matching result;

when the identification field of data to be supplemented is matched with the identification field of an existing data item, supplementing the data to be supplemented into the existing data item; and

when the identification field of the data to be supplemented is not matched with the identification field of each existing data item, creating a data item according to the identification field of the data to be supplemented, and adding the data to be supplemented into the created data item;

Wherein, in the data pool of each language, the mapping relation of mutual index is established between the data items with the matched identification fields, and after creating a data item according to the identification field of the data to be supplemented, the method further comprises the steps of:

judging whether the identification field of the created data item is matched with the identification fields of the data items in the rest data pools, if so, establishing a mapping relation of mutual indexes of the created data item and the data item matched with the identification field, so that according to one data item, the other data items of different languages of which the identification fields are matched with the data item are indexed.

2. The data augmentation method of claim 1, wherein the language identification model is constructed based on a language model N-Gram, and the step of identifying the language of each piece of data to be augmented by the language identification model comprises:

word segmentation is carried out on data to be supplemented, and a vocabulary set of the data to be supplemented is obtained;

inputting each vocabulary in the vocabulary set into the language identification model to obtain the probability that each vocabulary belongs to each language;

according to the probability that each vocabulary belongs to each language, the probability that the data to be supplemented belongs to each language is obtained; and

And determining the languages of the data to be supplemented according to the relation between the probability that the data to be supplemented belongs to each language and a preset probability threshold.

3. The data augmentation method of claim 1, wherein in the step of traversing the identification fields of the existing data items in the data pool according to the identification fields of the data to be augmented, when the similarity between the identification field of the data to be augmented and the identification field of an existing data item is greater than a first similarity threshold, a matching result of the identification field of the data to be augmented and the identification field of the existing data item is obtained; and

when the similarity between the identification field of the data to be supplemented and the identification field of each existing data item is smaller than a second similarity threshold, a matching result that the identification field of the data to be supplemented is not matched with the identification field of each existing data item is obtained;

the first similarity threshold is greater than the second similarity threshold.

4. A data augmentation method as claimed in claim 3, further comprising the step of:

when the maximum similarity is between the second similarity threshold and the first similarity threshold in the similarity between the identification field of the data to be supplemented and the identification field of each existing data item, pushing the data to be supplemented to a queue to be checked; and

And in the step of identifying the languages of each piece of data to be supplemented through the language identification model, pushing the data to be supplemented, of which the languages are not identified, to the queue to be checked.

5. A data augmentation method as claimed in claim 1, wherein said data to be augmented comprises logistical data, and said identification field comprises any one or more of the following: commercial tenant, driver, vehicle, address; and

the data to be augmented associated with the identification field includes data to be augmented that exactly matches the identification field and data to be augmented that implicitly matches the identification field.

6. The data augmentation method of claim 1, wherein in the step of supplementing the data to be augmented into the existing data entry, the data to be augmented is normalized, and the data to be augmented is merged with the existing data entry for deduplication; and

and normalizing the data to be supplemented in the step of adding the data to be supplemented to the created data entry.

7. A data augmentation system comprising:

the data acquisition module is used for acquiring a plurality of pieces of data to be supplemented, which are associated with the identification field, according to the identification field;

The language identification module is used for identifying the languages of each piece of data to be supplemented through the language identification model and shunting the data to be supplemented of the identified languages to a data pool of the corresponding languages;

the matching comparison module is used for traversing the identification fields of the existing data items in the data pool according to the identification fields of the data to be supplemented in the data pool of each language to obtain a matching result;

the data supplementing module is used for supplementing the data to be supplemented into the existing data item when the identification field of the data to be supplemented is matched with the identification field of the existing data item; and

the data adding module is used for creating a data item according to the identification field of the data to be supplemented and adding the data to be supplemented into the created data item when the identification field of the data to be supplemented is not matched with the identification field of each existing data item;

wherein, in the data pool of each language, the mapping relation of mutual indexes is established between the data items with the matched identification fields, and the data adding module further executes after creating a data item according to the identification field of the data to be supplemented:

8. A data augmentation device comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the data augmentation method of any one of claims 1 to 6 via execution of the executable instructions.

9. A computer readable storage medium storing a program, characterized in that the program when executed implements the steps of the data augmentation method of any one of claims 1 to 6.