CN113961696A

CN113961696A - Oracle automatic conjugation verification method based on Obibert

Info

Publication number: CN113961696A
Application number: CN202111273361.XA
Authority: CN
Inventors: 熊晶; 翟雪; 陈利平; 刘国英; 刘永革; 韩胜伟; 王楠; 张展
Original assignee: Anyang Normal University
Current assignee: Anyang Normal University
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-01-21
Anticipated expiration: 2041-10-29

Abstract

The invention discloses an automatic oracle conjugation verification method based on Obibert, which comprises the following steps of: s1, collecting a large amount of explanation texts of the oracle, and forming an oracle Bert corpus under the direct participation of an oracle expert; s2, forming a summation vector by an oracle paraphrase text in the oracle Bert corpus, specifically comprising the summation of Token embedding, text embedding and position embedding, and obtaining an Obibert neural network model; s3, judging whether the result of the automatic conjugation of the oracle bone fragments is correct or not by passing the oracle bone script on the conjugated oracle bone through an Obibert NSP model. The invention judges whether the result of the automatic conjugation of the oracle fragments is correct or not through Obibert, so as to screen the selectable item with the highest probability from the candidate result of the automatic conjugation of the computer by combining the oracle expressure text, namely, the method for judging whether the result of the automatic conjugation of the oracle fragments is correct or not is provided, and the application of the oracle is further improved.

Description

Oracle automatic conjugation verification method based on Obibert

Technical Field

The invention belongs to the technical field of oracle, and particularly relates to an oracle automatic conjugation verification method based on Obibert.

Background

The oracle-bone inscription is a treasure of Chinese nationality and has important historical value and scientific research significance. However, oracle bone fragments often exist in fragment form due to the characteristics and history of the oracle bone fragments, the materials and the like, and the correct splicing of the oracle bone fragments together is called oracle bone conjugation. In the actual oracle study, the study object is an image of an oracle photograph, a rubbing, and the like rather than an oracle real object. Traditional oracle conjugation research is completed by oracle experts through the steps of collecting oracle images, copying, cutting, splicing, proofreading and the like, and only experts with extremely deep research accumulation and conjugation experience can perform the research. This has greatly hindered the progress of modern oracle studies. The development of oracle conjugation studies has been greatly facilitated since the introduction of computer technology into oracle studies, as edge and contour based automatic conjugation of oracle fragments can be achieved based on image processing techniques. But the new problems are: the edges and the outlines of the oracle fragments are not strictly sutured, and due to abrasion of oracle materials and the existence of fine fragments, a large number of candidate results appear in the automatic conjugation (hereinafter referred to as automatic conjugation) of the oracle fragments of a computer, and obviously, the use of the image processing technology alone is not sufficient for the research work of the oracle fragment conjugation.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides an automatic oracle conjugation verification method based on Obibert. And selecting the selectable item with the highest probability from the candidate results of automatic conjugation of the computer by combining the oracle explanation text, namely providing a method for judging whether the automatic conjugation result of the oracle fragments is correct.

In order to solve the technical problems, the invention provides the following technical scheme:

the invention provides an automatic oracle conjugation verification method based on Obibert, which comprises the following steps of:

s1, collecting a large number of explanation texts of the oracle characters, and constructing an oracle character Bert corpus;

s2, vectorizing the oracle explanation text in the oracle Bert corpus to form an addition vector to obtain an Obibert neural network model, wherein the Oberbert neural network model specifically comprises a Token embedding, text embedding and position embedding mixed addition;

s3, judging whether the result of the automatic conjugation of the oracle fragments is correct or not through the oracle explanation text on the conjugated oracle slice by an Obibbert NSP model; the judging method comprises the following steps: extracting the paraphrase on any two automatically conjugated oracle bones, connecting the paraphrase on any two automatically conjugated carapace bones to obtain two sentences as input, adding a mark symbol to the NSP model, using the corresponding output as semantic representation of the paraphrase text, simultaneously segmenting the two input sentences by using a segmentation symbol, and respectively adding two different paraphrase text vectors to the two sentences for distinguishing; if the output of the model is correct, the conjugation of the two pieces of the oracle bone is correct; if the output of the model is wrong, it indicates that the conjugation of the two pieces of the nail bone is wrong.

As a preferred technical solution of the present invention, step S1 specifically includes the following steps:

s11, spacing the obtained oracle text according to characters, namely dividing one oracle character into one word, and removing punctuation marks in the text to accord with the characteristic that the oracle original text has no sentence reading mark;

s12, constructing a dictionary, counting the frequency of the oracle characters, representing each oracle character as an integer id according to the frequency, and recording the mapping relation between the oracle characters and the ids;

s13, representing the paraphrase text of the oracle-bone inscription as an id sequence according to the language sequence;

s14, training oracle explanation text corpora by using a CBOW neural network model of word2vec, scanning the corpora by adopting a sliding window with the size of 3, predicting central words in each window through context, and forming training data;

s15, obtaining a parameter matrix after training, wherein each row of the matrix is a word vector of a corresponding oracle character in the dictionary, and the row is the size of the dictionary.

As a preferable technical scheme of the invention, the method also comprises the following steps:

s4, if the two conjugated sheets are judged to be correct in step S3, combining them with the adjacent oracle bone sheet as a whole, repeating step S3 until all sheets in the result of automatic conjugation are judged to be correct, or retaining the maximum number of correct conjugated sheets as the final conjugation result.

s5, if the two conjugated pieces are judged to be wrong in the step S3, keeping any one piece, selecting another one piece to combine with the adjacent oracle bone piece, and repeating the steps S3 and S4 until all pieces in the automatic conjugation result are judged to be correct, or keeping the combination of the maximum number of correct conjugated pieces as the final conjugation result.

As a preferred technical scheme of the invention, Token is embedded into a word vector for establishing oracle, that is, each oracle word in an oracle explanation sentence is taken as a segmentation unit, and then Token is converted into a vector representation form with fixed dimensions; by [ CLS]Symbol mark the start of Token; by [ SEP ]]Symbol marks the end of Token; in view of the specificity of oracle characters, [ C ]]Representing the incomplete or fuzzy unrecognizable oracle bone; by [ U ]_n](wherein n ═ 1,2, 3.) denotes that temporarily alsoAn unknown oracle bone character.

As a preferred technical scheme of the invention, text embedding is an operation aiming at carapace-bone-script explanation sentence pairs; the concrete implementation is as follows: forming vectors by using indexes 0 and 1 to represent different oracle paraphrase sentences, namely, assigning 0 to all Token of the first sentence so as to form a first vector; assigning 1's to all Token's of the second sentence, thereby forming a second vector; if there is only one input sentence, its text is embedded as a vector with all indices being 0.

As a preferred technical scheme of the invention, the position embedding is to learn a vector representation at each position in the oracle explanation sentence to process text sequence information; the same oracle bone character appears at different positions and is represented by different vectors; the concrete implementation is as follows: a suitably sized look-up table is designed in which the first row is a vector representation of any oracle word in the first position, the second row is a vector representation of any oracle word in the second position, and so on.

As a preferred technical scheme of the invention, NSP is Next sequence Prediction, and the tasks of NSP are as follows: predicting whether sentence B is the next sentence of sentence a, the purpose of NSP is to obtain information between sentences.

Compared with the prior art, the invention has the following beneficial effects:

the method judges whether the result of the automatic conjugation of the oracle bone fragments is correct or not through the oracle bone script corpus, so as to screen the selectable item with the highest probability from the candidate result of the automatic conjugation of the computer by combining the oracle bone expressage text, namely, the method for judging whether the result of the automatic conjugation of the oracle bone fragments is correct or not is provided, and the application of the oracle bone fragments is further improved.

Drawings

FIG. 1 is a working diagram of the oracle automatic conjugation verification method based on Obibert of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example 1

In order to achieve the object of the present invention, as shown in fig. 1, in one embodiment of the present invention, there is provided an ObiBert-based oracle automatic conjugation verification method, including the steps of:

s1, collecting a large number of explanation texts of the oracle characters, and constructing the oracle character Bert corpus. The method specifically comprises the following steps:

S2, vectorizing the oracle explanation text in the oracle Bert corpus to form a sum vector, and obtaining the Obibert neural network model, wherein the Oberbert neural network model specifically comprises Token embedding, text embedding and position embedding mixed sum.

Specifically, Token is embedded into a word vector for establishing oracle, that is, each oracle word in an oracle explanation sentence is used as a segmentation unit, and then Token is converted into a vector representation form with fixed dimensions; by [ CLS]Symbol mark the start of Token; by [ SEP ]]Symbol marks the end of Token; in view of the specificity of oracle characters, [ C ]]Representing the incomplete or fuzzy unrecognizable oracle bone; by [ U ]_n](wherein n ═ 1,2, 3.) means that it is not yet temporaryThe recognized oracle bone word.

Specifically, text embedding is an operation on oracle paraphrase sentence pairs; the concrete implementation is as follows: forming vectors by using indexes 0 and 1 to represent different oracle paraphrase sentences, namely, assigning 0 to all Token of the first sentence so as to form a first vector; assigning 1's to all Token's of the second sentence, thereby forming a second vector; if there is only one input sentence, its text is embedded as a vector with all indices being 0.

Specifically, the position embedding is to learn a vector representation at each position in the oracle explanation sentence to process text sequence information; the same oracle bone character appears at different positions and is represented by different vectors; the concrete implementation is as follows: a suitably sized look-up table is designed in which the first row is a vector representation of any oracle word in the first position, the second row is a vector representation of any oracle word in the second position, and so on.

The NSP is a Next sequence Prediction, and the tasks of the NSP are as follows: predicting whether sentence B is the next sentence of sentence a, the purpose of NSP is to obtain information between sentences.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An automatic oracle conjugation verification method based on Obibert is characterized by comprising the following steps:

s3, judging whether the result of the automatic conjugation of the oracle fragments is correct or not through the oracle explanation text on the conjugated oracle slice by an Obibbert NSP model; the judging method comprises the following steps: extracting the postscript linked on any two automatically conjugated oracle bones to obtain two sentences as input, adding a mark symbol by the NSP model, using the corresponding output as semantic representation of the postscript text, simultaneously segmenting the two input sentences by using a segmentation symbol, and respectively adding two different postscript text vectors to the two sentences for distinguishing; if the output of the model is correct, the conjugation of the two pieces of the oracle bone is correct; if the output of the model is wrong, it indicates that the conjugation of the two pieces of the nail bone is wrong.

2. The ObiBert-based oracle automatic conjugation verification method according to claim 1, wherein step S1 specifically comprises the following steps:

3. The ObiBert-based oracle automatic conjugation verification method according to claim 1, further comprising the steps of:

4. The ObiBert-based oracle automatic conjugation verification method according to claim 1, further comprising the steps of:

5. The ObiBert-based oracle automatic conjugation verification method according to claim 1, wherein Token is embedded into a word vector for establishing oracle, that is, each oracle word in an oracle paraphrase sentence is taken as a segmentation unit, and then Token is converted into a vector representation form with fixed dimensions; by [ CLS]Symbol mark the start of Token; by [ SEP ]]Symbol marks the end of Token; in view of the specificity of oracle characters, [ C ]]Representing the incomplete or fuzzy unrecognizable oracle bone; by [ U ]_n](where n ═ 1,2, 3.) denotes an oracle character which is not recognized temporarily.

6. The ObiBert-based oracle automatic conjugation verification method according to claim 1, wherein text embedding is an operation for oracle paraphrase sentence pairs; the concrete implementation is as follows: forming vectors by using indexes 0 and 1 to represent different oracle paraphrase sentences, namely, assigning 0 to all Token of the first sentence so as to form a first vector; assigning 1's to all Token's of the second sentence, thereby forming a second vector; if there is only one input sentence, its text is embedded as a vector with all indices being 0.

7. The ObiBert-based oracle automatic conjugation verification method according to claim 1, wherein the position embedding is to learn a vector representation at each position in the oracle paraphrase sentence to process text sequence information; the same oracle bone character appears at different positions and is represented by different vectors; the concrete implementation is as follows: a suitably sized look-up table is designed in which the first row is a vector representation of any oracle word in the first position, the second row is a vector representation of any oracle word in the second position, and so on.

8. The ObiBert-based oracle automatic conjugation verification method according to claim 1, wherein NSP is Next sequence Prediction, and the tasks of NSP are: predicting whether sentence B is the next sentence of sentence a, the purpose of NSP is to obtain information between sentences.