CN117973378A - Term collocation extraction method and device based on bert model - Google Patents

Term collocation extraction method and device based on bert model Download PDF

Info

Publication number
CN117973378A
CN117973378A CN202410019481.4A CN202410019481A CN117973378A CN 117973378 A CN117973378 A CN 117973378A CN 202410019481 A CN202410019481 A CN 202410019481A CN 117973378 A CN117973378 A CN 117973378A
Authority
CN
China
Prior art keywords
word
matrix
bert model
matrixes
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410019481.4A
Other languages
Chinese (zh)
Other versions
CN117973378B (en
Inventor
王淼
徐娟
殷晓君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Original Assignee
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING LANGUAGE AND CULTURE UNIVERSITY filed Critical BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority to CN202410019481.4A priority Critical patent/CN117973378B/en
Publication of CN117973378A publication Critical patent/CN117973378A/en
Application granted granted Critical
Publication of CN117973378B publication Critical patent/CN117973378B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of text processing, in particular to a method and a device for extracting word collocation based on bert models, wherein the method comprises the following steps: acquiring a training sample, and training an initial bert model based on the training sample to obtain a trained bert model; acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model; determining a plurality of word forming matrixes meeting preset conditions in attention matrixes; according to attention matrixes, calculating collocation degree scores among the word matrixes, and determining collocation relations among the word matrixes according to the calculated collocation degree scores. By adopting the method, the words are firstly extracted through the correlation between the words, then the matching degree between the words is determined, and then the matching relation between the words is extracted, so that the labor cost and the time cost are saved, the matching degree of the words can be determined with high efficiency and high quality through the existing bert model, and the matching efficiency of the words is improved.

Description

Term collocation extraction method and device based on bert model
Technical Field
The invention relates to the technical field of text processing, in particular to a method and a device for extracting word collocation based on bert models.
Background
The automatic extraction of word collocation is one of basic problems of text understanding, and plays an important role in language teaching, search engines and recommendation systems. For example, the "improvement" and the "achievement" in the "improvement of the next achievement" are collocation relations.
The main methods for automatic extraction of word collocation at present comprise the following three methods:
1. after word segmentation, word co-occurrence statistics are based.
2. And constructing a semantic knowledge base, and determining collocation relations through feature matching.
3. Extraction is performed using a syntactic semantic dependency model.
However, the three conventional methods still have the following disadvantages:
1. The method based on word co-occurrence cannot extract a single sentence.
2. Constructing a knowledge base and using a syntactic grammar dependency model requires a large amount of expert labeling work to train the model, resulting in very high costs, while training the model with fewer samples can result in reduced accuracy of the model.
Disclosure of Invention
In order to solve the problems of high cost and low accuracy in the prior art, the embodiment of the invention provides a method and a device for extracting word collocation based on bert models. The technical scheme is as follows:
In one aspect, a method for extracting word collocation based on bert model is provided, the method is implemented by word collocation extracting equipment based on bert model, the method includes:
S1, acquiring a training sample, and training an initial bert model based on the training sample to obtain a trained bert model;
S2, acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model;
S3, determining a plurality of word forming matrixes meeting preset conditions from the attention matrixes;
and S4, calculating the matching degree score among the word matrixes according to the attention matrixes, and determining the matching relation among the word matrixes according to the calculated matching degree score.
Optionally, the step of obtaining a training sample in step S1, training the initial bert model based on the training sample to obtain a trained bert model, includes:
collecting a corpus composed of articles;
Sentence segmentation is carried out on articles in the corpus to obtain a plurality of training sample sentences;
and training the initial bert model according to the training sample sentence to obtain a trained bert model.
Optionally, the attention matrix is a matrix with dimension n×n, where N is the number of words included in the input sentence;
Element (i, j) in the attention matrix represents the semantic correlation strength of the jth word to the ith word, and element (i, i) on the diagonal of the attention matrix represents the autocorrelation strength of each word.
Optionally, determining, in the attention matrices, a plurality of word forming matrices that meet a preset condition in S3 includes:
S31, traversing attention the matrix according to the set word forming matrix length, and determining a word forming matrix to be determined, wherein the length and the width of the word forming matrix are smaller than or equal to the length of the word forming matrix;
S32, for any word matrix to be determined, if the sum of elements and values in the word matrix to be determined is greater than or equal to a preset first threshold value and the sum of elements and values except elements on diagonal lines is greater than or equal to a second threshold value, determining the word matrix to be determined as the word matrix meeting preset conditions.
Optionally, the step S4 of calculating a matching degree score between the word forming matrices according to the attention matrices, and determining a matching relationship between the word forming matrices according to the calculated matching degree score includes:
S41, selecting any two word forming matrixes in a plurality of word forming matrixes, wherein the word forming matrixes are respectively represented by A and B, the upper left corner element in the word forming matrix A is represented by (a 0, a 0), the lower right corner element in the word forming matrix A is represented by (as, as), the upper left corner element in the word forming matrix B is represented by (B0, B0), the lower right corner element in the word forming matrix B is represented by (bs, bs), a=a0, b=b0, and z=0;
s42, calculating z=att_matrix [ a ] [ b ] +z;
s43, judging whether a is smaller than as, if yes, a=a+1, turning to execute S42, and if not, executing S44;
s44, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S42, if not, executing S45;
S45、a=a0,b=b0,f=0;
s46, calculating f=att_matrix [ b ] [ a ] +f;
s47, judging whether a is smaller than as, if yes, a=a+1, turning to execute S45, and if not, executing S48;
S48, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S45, if not, executing S49;
S49, calculating an average value z 'of z, calculating an average value f' of f, comparing the magnitudes of z 'and f', determining a larger value as the matching degree score of the word forming matrix A and the word forming matrix B, determining that the word forming matrix A and the word forming matrix B meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is larger than or equal to a preset threshold value, and determining that the word forming matrix A and the word forming matrix B do not meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is smaller than the preset threshold value.
On the other hand, a word collocation extraction device based on bert model is provided, the device is applied to a word collocation extraction method based on bert model, the device comprises:
the training module is used for acquiring a training sample, training the initial bert model based on the training sample, and obtaining a trained bert model;
The acquisition module is used for acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model;
The determining module is used for determining a plurality of word forming matrixes meeting preset conditions from the attention matrixes;
And the calculation module is used for calculating the matching degree score among the word matrixes according to the attention matrixes and determining the matching relation among the word matrixes according to the calculated matching degree score.
Optionally, the training module is configured to:
collecting a corpus composed of articles;
Sentence segmentation is carried out on articles in the corpus to obtain a plurality of training sample sentences;
and training the initial bert model according to the training sample sentence to obtain a trained bert model.
Optionally, the attention matrix is a matrix with dimension n×n, where N is the number of words included in the input sentence;
Element (i, j) in the attention matrix represents the semantic correlation strength of the jth word to the ith word, and element (i, i) on the diagonal of the attention matrix represents the autocorrelation strength of each word.
Optionally, the determining module is configured to:
S31, traversing attention the matrix according to the set word forming matrix length, and determining a word forming matrix to be determined, wherein the length and the width of the word forming matrix are smaller than or equal to the length of the word forming matrix;
S32, for any word matrix to be determined, if the sum of elements and values in the word matrix to be determined is greater than or equal to a preset first threshold value and the sum of elements and values except elements on diagonal lines is greater than or equal to a second threshold value, determining the word matrix to be determined as the word matrix meeting preset conditions.
Optionally, the computing module is configured to:
S41, selecting any two word forming matrixes in a plurality of word forming matrixes, wherein the word forming matrixes are respectively represented by A and B, the upper left corner element in the word forming matrix A is represented by (a 0, a 0), the lower right corner element in the word forming matrix A is represented by (as, as), the upper left corner element in the word forming matrix B is represented by (B0, B0), the lower right corner element in the word forming matrix B is represented by (bs, bs), a=a0, b=b0, and z=0;
s42, calculating z=att_matrix [ a ] [ b ] +z;
s43, judging whether a is smaller than as, if yes, a=a+1, turning to execute S42, and if not, executing S44;
s44, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S42, if not, executing S45;
S45、a=a0,b=b0,f=0;
s46, calculating f=att_matrix [ b ] [ a ] +f;
s47, judging whether a is smaller than as, if yes, a=a+1, turning to execute S45, and if not, executing S48;
S48, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S45, if not, executing S49;
S49, calculating an average value z 'of z, calculating an average value f' of f, comparing the magnitudes of z 'and f', determining a larger value as the matching degree score of the word forming matrix A and the word forming matrix B, determining that the word forming matrix A and the word forming matrix B meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is larger than or equal to a preset threshold value, and determining that the word forming matrix A and the word forming matrix B do not meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is smaller than the preset threshold value.
In another aspect, an electronic device is provided, where the electronic device includes a processor and a memory, where the memory stores at least one instruction that is loaded and executed by the processor to implement the above-mentioned term collocation extraction method based on bert models.
In another aspect, a computer readable storage medium is provided, in which at least one instruction is stored, the at least one instruction being loaded and executed by a processor to implement the above-described term collocation extraction method based on bert models.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
In the embodiment of the invention, a training sample is obtained, and an initial bert model is trained based on the training sample to obtain a trained bert model; acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model; determining a plurality of word forming matrixes meeting preset conditions in attention matrixes; according to attention matrixes, calculating collocation degree scores among the word matrixes, and determining collocation relations among the word matrixes according to the calculated collocation degree scores. According to the invention, special word collocation extraction models do not need to be trained by special expert knowledge and sample labeling, words are firstly extracted through the correlation between words by utilizing the internal attention matrix of the existing bert model, then the collocation degree between the words is determined through the correlation between the words, so that the word collocation relation is extracted, the labor cost is saved, the time cost is also saved, the word collocation degree can be determined with high efficiency and high quality by utilizing the existing bert model, and the word collocation extraction efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a word collocation extraction method based on bert models provided by an embodiment of the present invention;
FIG. 2 is a block diagram of a word collocation extraction device based on bert models provided by an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a word collocation extraction device based on bert models according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
The embodiment of the invention provides a term collocation extraction method based on bert model, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 1, a word collocation extraction method flowchart based on bert models, the process flow of the method may include the following steps:
s1, acquiring a training sample, and training an initial bert model based on the training sample to obtain a trained bert model.
In one possible implementation, during the data preparation stage, a corpus may be collected first, and the collected corpus may be collected by capturing through the internet.
Optionally, the specific steps of S1 may include the following steps S11-S13:
s11, collecting a corpus formed by articles;
s12, sentence segmentation is carried out on articles in the language library, so that a plurality of training sample sentences are obtained;
and S13, training the initial bert model according to the training sample sentence to obtain a trained bert model.
In a possible implementation manner, the training manner of the initial bert model may be a training manner commonly used in the prior art, which is not described in detail in the present invention.
S2, acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model.
In a possible implementation manner, the overall structure of the bert model can be divided into an input representation (Input Embeddings), a Multi-Head Attention mechanism (Multi-Head Attention), a residual connection (Add) and LayerNorm, a Feed Forward network (Feed Forward), a residual connection (Add) and LayerNorm, and an output.
Specifically, attention matrices are matrices with dimensions n×n, N being the number of words included in the input sentence;
Element (i, j) in the attention matrix represents the semantic correlation strength of the jth word to the ith word, and element (i, i) on the diagonal of the attention matrix represents the autocorrelation strength of each word.
For example, the sentence "improve next performance" may be represented by the following attention matrix in table 1:
TABLE 1
Lifting handle High height Lower part(s) Secondary times Finished products Score of
Lifting handle 0.506 0.2 0.06 0.017 0.117 0.1
High height 0.2 0.47 0.023 0.023 0.121 0.16
Lower part(s) 0.05 0.05 0.7 0.1 0.05 0.05
Secondary times 0.06 0.06 0.1 0.68 0.05 0.05
Finished products 0.1 0.1 0.04 0.04 0.37 0.35
Score of 0.1 0.1 0.05 0.04 0.3 0.41
S3, determining a plurality of word forming matrixes meeting preset conditions in attention matrixes.
Optionally, determining a plurality of word forming matrices satisfying the preset condition in the attention matrices in S3 may include the following steps S31-S32:
And S31, traversing attention the matrix according to the set length of the word forming matrix, and determining a word forming matrix to be determined, wherein the length and the width of the word forming matrix are smaller than or equal to the length of the word forming matrix.
S32, for any word matrix to be determined, if the sum of elements and values in the word matrix to be determined is greater than or equal to a preset first threshold value and the sum of elements and values except elements on a diagonal line is greater than or equal to a second threshold value, determining the word matrix to be determined as the word matrix meeting the preset condition.
In a possible implementation, the attention matrices are searched for internal matrices that satisfy the word-forming condition, as explained in the following simple pseudo-code:
For b=0;b<N;
Each lookup setting e=n-1
For the current b, searching for the minimum internal matrix meeting the condition, wherein the matrix is (b, b) at the upper left and (e, e) at the lower right
The term condition is that for each row q, b < = q < = e of the internal matrix;
1) The sum of att_matrix [ q ] [ p ] is greater than or equal to a first threshold T1, b < = p < = e;
2) The sum of att_matrix [ q ] [ h ] is greater than or equal to a second threshold T2, b < = h < = e and h-! =q;
If the word matrix to be determined meeting the condition is found, putting the word matrix to be determined into a set, and starting the next search from e+1, namely b=e+1;
if no matrix is found that satisfies the condition, e=e-1, b is unchanged and the search is restarted.
In the example corresponding to table 1, the first threshold t1=0.65 and the second threshold t2=0.4, the word forming matrix satisfying the word forming condition may be:
The word matrix corresponding to "up" is left upper (1, 1) and right lower (2, 2).
The word matrix corresponding to the score is left upper (5, 5) and right lower (6, 6).
The following description is made with pseudo code:
s4, calculating collocation degree scores among the word matrixes according to attention matrixes, and determining collocation relations among the word matrixes according to the calculated collocation degree scores.
In a possible implementation manner, after a plurality of word matrixes are found, the following collocation judgment is performed:
And judging whether the word forming matrixes meet the collocation relation or not in pairs.
For each pair of word forming matrixes, ase:Sub>A word forming matrix A and ase:Sub>A word forming matrix B are respectively calculated to match degree scores of the two directions A-B and B-A.
The collocation degree of A-B is calculated as follows:
for each row index a_index of a and each row index b_index of B:
Summing att_matrix [ a_index ] [ b_index ];
Then averaging to obtain match_score (A- > B);
similarly, match_score (B- > A) can be obtained;
The final collocation degree of the matrix AB is divided into bits:
match_score(A,B)=max(match_score(A->B),match_score(B->A))
if the matching relation condition is greater than or equal to a third threshold, for example, 0.1, the matching relation condition is judged to be satisfied.
In the example corresponding to table 1, the word matrices of "improvement" and "achievement" are satisfied.
Specifically, the operation of S4 may include the following steps S41-S49:
S41, selecting any two word forming matrixes in a plurality of word forming matrixes, wherein the word forming matrixes are respectively represented by A and B, the upper left corner element in the word forming matrix A is represented by (a 0, a 0), the lower right corner element in the word forming matrix A is represented by (as, as), the upper left corner element in the word forming matrix B is represented by (B0, B0), the lower right corner element in the word forming matrix B is represented by (bs, bs), a=a0, b=b0, and z=0;
s42, calculating z=att_matrix [ a ] [ b ] +z;
s43, judging whether a is smaller than as, if yes, a=a+1, turning to execute S42, and if not, executing S44;
s44, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S42, if not, executing S45;
S45、a=a0,b=b0,f=0;
s46, calculating f=att_matrix [ b ] [ a ] +f;
s47, judging whether a is smaller than as, if yes, a=a+1, turning to execute S45, and if not, executing S48;
S48, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S45, if not, executing S49;
S49, calculating an average value z 'of z, calculating an average value f' of f, comparing the magnitudes of z 'and f', determining a larger value as the matching degree score of the word forming matrix A and the word forming matrix B, determining that the word forming matrix A and the word forming matrix B meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is larger than or equal to a preset threshold value, and determining that the word forming matrix A and the word forming matrix B do not meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is smaller than the preset threshold value.
In the embodiment of the invention, a training sample is obtained, and an initial bert model is trained based on the training sample to obtain a trained bert model; acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model; determining a plurality of word forming matrixes meeting preset conditions in attention matrixes; according to attention matrixes, calculating collocation degree scores among the word matrixes, and determining collocation relations among the word matrixes according to the calculated collocation degree scores. According to the invention, special word collocation extraction models do not need to be trained by special expert knowledge and sample labeling, words are firstly extracted through the correlation between words by utilizing the internal attention matrix of the existing bert model, then the collocation degree between the words is determined through the correlation between the words, so that the word collocation relation is extracted, the labor cost is saved, the time cost is also saved, the word collocation degree can be determined with high efficiency and high quality by utilizing the existing bert model, and the word collocation extraction efficiency is improved.
Fig. 2 is a block diagram of a word collocation extraction apparatus based on bert model for use in a bert model-based word collocation extraction method, according to an example embodiment. Referring to fig. 2, the apparatus includes a training module 210, an acquisition module 220, a determination module 230, and a calculation module 240, wherein:
The training module 210 is configured to obtain a training sample, and train the initial bert model based on the training sample to obtain a trained bert model;
The obtaining module 220 is configured to obtain an input sentence of information to be extracted, and input the input sentence into a trained bert model to obtain a attention matrix of an uppermost layer of the bert model;
A determining module 230, configured to determine a plurality of word forming matrices that satisfy a preset condition from the attention matrices;
The calculating module 240 is configured to calculate a matching degree score between the word forming matrices according to the attention matrices, and determine a matching relationship between the word forming matrices according to the calculated matching degree score.
Optionally, the training module 210 is configured to:
collecting a corpus composed of articles;
Sentence segmentation is carried out on articles in the corpus to obtain a plurality of training sample sentences;
and training the initial bert model according to the training sample sentence to obtain a trained bert model.
Optionally, the attention matrix is a matrix with dimension n×n, where N is the number of words included in the input sentence;
Element (i, j) in the attention matrix represents the semantic correlation strength of the jth word to the ith word, and element (i, i) on the diagonal of the attention matrix represents the autocorrelation strength of each word.
Optionally, the determining module 230 is configured to:
S31, traversing attention the matrix according to the set word forming matrix length, and determining a word forming matrix to be determined, wherein the length and the width of the word forming matrix are smaller than or equal to the length of the word forming matrix;
S32, for any word matrix to be determined, if the sum of elements and values in the word matrix to be determined is greater than or equal to a preset first threshold value and the sum of elements and values except elements on diagonal lines is greater than or equal to a second threshold value, determining the word matrix to be determined as the word matrix meeting preset conditions.
Optionally, the computing module 240 is configured to:
S41, selecting any two word forming matrixes in a plurality of word forming matrixes, wherein the word forming matrixes are respectively represented by A and B, the upper left corner element in the word forming matrix A is represented by (a 0, a 0), the lower right corner element in the word forming matrix A is represented by (as, as), the upper left corner element in the word forming matrix B is represented by (B0, B0), the lower right corner element in the word forming matrix B is represented by (bs, bs), a=a0, b=b0, and z=0;
s42, calculating z=att_matrix [ a ] [ b ] +z;
s43, judging whether a is smaller than as, if yes, a=a+1, turning to execute S42, and if not, executing S44;
s44, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S42, if not, executing S45;
S45、a=a0,b=b0,f=0;
s46, calculating f=att_matrix [ b ] [ a ] +f;
s47, judging whether a is smaller than as, if yes, a=a+1, turning to execute S45, and if not, executing S48;
S48, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S45, if not, executing S49;
S49, calculating an average value z 'of z, calculating an average value f' of f, comparing the magnitudes of z 'and f', determining a larger value as the matching degree score of the word forming matrix A and the word forming matrix B, determining that the word forming matrix A and the word forming matrix B meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is larger than or equal to a preset threshold value, and determining that the word forming matrix A and the word forming matrix B do not meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is smaller than the preset threshold value.
In the embodiment of the invention, a training sample is obtained, and an initial bert model is trained based on the training sample to obtain a trained bert model; acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model; determining a plurality of word forming matrixes meeting preset conditions in attention matrixes; according to attention matrixes, calculating collocation degree scores among the word matrixes, and determining collocation relations among the word matrixes according to the calculated collocation degree scores. According to the invention, special word collocation extraction models do not need to be trained by special expert knowledge and sample labeling, words are firstly extracted through the correlation between words by utilizing the internal attention matrix of the existing bert model, then the collocation degree between the words is determined through the correlation between the words, so that the word collocation relation is extracted, the labor cost is saved, the time cost is also saved, the word collocation degree can be determined with high efficiency and high quality by utilizing the existing bert model, and the word collocation extraction efficiency is improved.
Fig. 3 is a schematic structural diagram of a word collocation extraction device based on bert model according to an embodiment of the present invention, where, as shown in fig. 3, the word collocation extraction device based on bert model may include the word collocation extraction device based on bert model shown in fig. 2. Optionally, the term collocation extraction device 310 based on bert models may include a processor 2001.
Optionally, the term collocation extraction device 310 based on bert models may also include a memory 2002 and a transceiver 2003.
The processor 2001 may be connected to the memory 2002 and the transceiver 2003 via a communication bus, for example.
The following describes the respective constituent components of the term collocation extraction apparatus 310 based on bert model in detail with reference to fig. 3:
The processor 2001 is a control center of the word collocation extraction device 310 based on bert models, and may be one processor or a generic name of a plurality of processing elements. For example, processor 2001 is one or more central processing units (central processing unit, CPU), but may also be an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, such as: one or more microprocessors (DIGITAL SIGNAL processors, DSPs), or one or more field programmable gate arrays (field programmable GATE ARRAY, FPGAs).
Alternatively, the processor 2001 may perform various functions of the term collocation extraction device 310 based on the bert model by running or executing a software program stored in the memory 2002, and invoking data stored in the memory 2002.
In a particular implementation, the processor 2001 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 3, as an example.
In a specific implementation, as an embodiment, the term collocation extraction device 310 based on bert models may also include multiple processors, such as the processor 2001 and the processor 2003 shown in fig. 3. Each of these processors may be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
The memory 2002 is used for storing a software program for executing the solution of the present invention, and is controlled by the processor 2001 to execute the solution, and the specific implementation may refer to the above method embodiment, which is not described herein again.
Alternatively, memory 2002 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), compact disc read-only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, without limitation. Memory 2002 may be integrated with processor 2001 or may exist separately and be coupled to processor 2001 via an interface circuit (not shown in fig. 3) of term collocation extraction device 310 based on bert models, as embodiments of the invention are not limited in detail.
A transceiver 2003 for communicating with a network device or with a terminal device.
Alternatively, transceiver 2003 may include a receiver and a transmitter (not separately shown in fig. 3). The receiver is used for realizing the receiving function, and the transmitter is used for realizing the transmitting function.
Alternatively, transceiver 2003 may be integrated with processor 2001 or may exist separately and be coupled to processor 2001 through an interface circuit (not shown in fig. 3) of term collocation extraction device 310 based on bert models, as embodiments of the invention are not specifically limited.
It should be noted that the structure of the word collocation extraction device 310 based on the bert model shown in fig. 3 is not limited to this router, and an actual knowledge structure recognition device may include more or less components than those shown, or may combine some components, or may be a different arrangement of components.
In addition, the technical effects of the word collocation extraction device 310 based on bert model may refer to the technical effects of a word collocation extraction method based on bert model described in the above method embodiments, which are not described herein.
It is to be appreciated that the processor 2001 in embodiments of the invention may be a central processing unit (central processing unit, CPU) which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should also be appreciated that the memory in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of random access memory (random access memory, RAM) are available, such as static random access memory (STATIC RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (doubledata RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).
The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.
In the present invention, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for extracting word collocations based on bert models, the method comprising:
S1, acquiring a training sample, and training an initial bert model based on the training sample to obtain a trained bert model;
S2, acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model;
S3, determining a plurality of word forming matrixes meeting preset conditions from the attention matrixes;
and S4, calculating the matching degree score among the word matrixes according to the attention matrixes, and determining the matching relation among the word matrixes according to the calculated matching degree score.
2. The method of claim 1, wherein the obtaining the training sample of S1 trains the initial bert model based on the training sample to obtain a trained bert model, comprising:
collecting a corpus composed of articles;
Sentence segmentation is carried out on articles in the corpus to obtain a plurality of training sample sentences;
and training the initial bert model according to the training sample sentence to obtain a trained bert model.
3. The method of claim 1, wherein the attention matrix is a matrix of dimension N x N, N being the number of words comprised by the input sentence;
Element (i, j) in the attention matrix represents the semantic correlation strength of the jth word to the ith word, and element (i, i) on the diagonal of the attention matrix represents the autocorrelation strength of each word.
4. The method of claim 3, wherein determining, in the attention matrices, a plurality of word forming matrices that satisfy a preset condition in S3 includes:
S31, traversing attention the matrix according to the set word forming matrix length, and determining a word forming matrix to be determined, wherein the length and the width of the word forming matrix are smaller than or equal to the length of the word forming matrix;
S32, for any word matrix to be determined, if the sum of elements and values in the word matrix to be determined is greater than or equal to a preset first threshold value and the sum of elements and values except elements on diagonal lines is greater than or equal to a second threshold value, determining the word matrix to be determined as the word matrix meeting preset conditions.
5. The method of claim 4, wherein the step of calculating the collocation degree score between the word-forming matrices according to the attention matrix in S4, and determining the collocation relation between the word-forming matrices according to the calculated collocation degree score comprises:
S41, selecting any two word forming matrixes in a plurality of word forming matrixes, wherein the word forming matrixes are respectively represented by A and B, the upper left corner element in the word forming matrix A is represented by (a 0, a 0), the lower right corner element in the word forming matrix A is represented by (as, as), the upper left corner element in the word forming matrix B is represented by (B0, B0), the lower right corner element in the word forming matrix B is represented by (bs, bs), a=a0, b=b0, and z=0;
s42, calculating z=att_matrix [ a ] [ b ] +z;
s43, judging whether a is smaller than as, if yes, a=a+1, turning to execute S42, and if not, executing S44;
s44, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S42, if not, executing S45;
S45、a=a0,b=b0,f=0;
s46, calculating f=att_matrix [ b ] [ a ] +f;
s47, judging whether a is smaller than as, if yes, a=a+1, turning to execute S45, and if not, executing S48;
S48, b=b+1, judging whether b is smaller than bs, if yes, turning to execute S45, if not, executing S49;
S49, calculating an average value z 'of z, calculating an average value f' of f, comparing the magnitudes of z 'and f', determining a larger value as the matching degree score of the word forming matrix A and the word forming matrix B, determining that the word forming matrix A and the word forming matrix B meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is larger than or equal to a preset threshold value, and determining that the word forming matrix A and the word forming matrix B do not meet the matching relation if the matching degree score of the word forming matrix A and the word forming matrix B is smaller than the preset threshold value.
6. A term collocation extraction device based on bert models, wherein the device is used for a term collocation extraction method based on bert models, and the device comprises:
the training module is used for acquiring a training sample, training the initial bert model based on the training sample, and obtaining a trained bert model;
The acquisition module is used for acquiring an input sentence of information to be extracted, and inputting the input sentence into a trained bert model to obtain a attention matrix of the uppermost layer of the bert model;
The determining module is used for determining a plurality of word forming matrixes meeting preset conditions from the attention matrixes;
And the calculation module is used for calculating the matching degree score among the word matrixes according to the attention matrixes and determining the matching relation among the word matrixes according to the calculated matching degree score.
7. The apparatus of claim 6, wherein the training module is configured to:
collecting a corpus composed of articles;
Sentence segmentation is carried out on articles in the corpus to obtain a plurality of training sample sentences;
and training the initial bert model according to the training sample sentence to obtain a trained bert model.
8. The apparatus of claim 6, wherein the attention matrix is a matrix of dimension N x N, N being the number of words comprised by the input sentence;
Element (i, j) in the attention matrix represents the semantic correlation strength of the jth word to the ith word, and element (i, i) on the diagonal of the attention matrix represents the autocorrelation strength of each word.
9. The word collocation extraction equipment based on bert model is characterized in that the word collocation extraction equipment based on bert model includes:
A processor;
A memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 5.
10. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method of any one of claims 1 to 5.
CN202410019481.4A 2024-01-05 2024-01-05 Term collocation extraction method and device based on bert model Active CN117973378B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410019481.4A CN117973378B (en) 2024-01-05 2024-01-05 Term collocation extraction method and device based on bert model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410019481.4A CN117973378B (en) 2024-01-05 2024-01-05 Term collocation extraction method and device based on bert model

Publications (2)

Publication Number Publication Date
CN117973378A true CN117973378A (en) 2024-05-03
CN117973378B CN117973378B (en) 2024-07-09

Family

ID=90857358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410019481.4A Active CN117973378B (en) 2024-01-05 2024-01-05 Term collocation extraction method and device based on bert model

Country Status (1)

Country Link
CN (1) CN117973378B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051871A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Text extraction method, apparatus, and device, and storage medium
CN113569016A (en) * 2021-09-27 2021-10-29 北京语言大学 Bert model-based professional term extraction method and device
CN117217277A (en) * 2023-04-07 2023-12-12 腾讯科技(深圳)有限公司 Pre-training method, device, equipment, storage medium and product of language model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051871A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Text extraction method, apparatus, and device, and storage medium
CN113569016A (en) * 2021-09-27 2021-10-29 北京语言大学 Bert model-based professional term extraction method and device
CN117217277A (en) * 2023-04-07 2023-12-12 腾讯科技(深圳)有限公司 Pre-training method, device, equipment, storage medium and product of language model

Also Published As

Publication number Publication date
CN117973378B (en) 2024-07-09

Similar Documents

Publication Publication Date Title
CN109886294B (en) Knowledge fusion method, apparatus, computer device and storage medium
CN105045781B (en) Query term similarity calculation method and device and query term search method and device
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
CN111339427B (en) Book information recommendation method, device and system and storage medium
WO2021031825A1 (en) Network fraud identification method and device, computer device, and storage medium
CN110083681B (en) Searching method, device and terminal based on data analysis
CN104715063B (en) search ordering method and device
CN108776673B (en) Automatic conversion method and device of relation mode and storage medium
CN111563384A (en) Evaluation object identification method and device for E-commerce products and storage medium
CN108153735B (en) Method and system for acquiring similar meaning words
EP3783522A1 (en) Semantic model instantiation method, system and device
WO2013159246A1 (en) Detecting valuable sections in webpage
CN109344246B (en) Electronic questionnaire generating method, computer readable storage medium and terminal device
CN114528413B (en) Knowledge graph updating method, system and readable storage medium supported by crowdsourced marking
CN110209780B (en) Question template generation method and device, server and storage medium
CN103577547B (en) Webpage type identification method and device
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN105243053B (en) Extract the method and device of document critical sentence
CN113850523A (en) ESG index determining method based on data completion and related product
CN117973378B (en) Term collocation extraction method and device based on bert model
CN112395866A (en) Customs declaration data matching method and device
CN112395865B (en) Check method and device for customs clearance sheet
CN112559474B (en) Log processing method and device
CN110427541B (en) Webpage content extraction method, system, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant