CN116502611B - Labeling method, labeling device, equipment and readable storage medium - Google Patents

Labeling method, labeling device, equipment and readable storage medium Download PDF

Info

Publication number
CN116502611B
CN116502611B CN202310771543.2A CN202310771543A CN116502611B CN 116502611 B CN116502611 B CN 116502611B CN 202310771543 A CN202310771543 A CN 202310771543A CN 116502611 B CN116502611 B CN 116502611B
Authority
CN
China
Prior art keywords
character string
string
substring
sub
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310771543.2A
Other languages
Chinese (zh)
Other versions
CN116502611A (en
Inventor
范俊杰
李发成
张如高
虞正华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Magic Vision Intelligent Technology Co ltd
Original Assignee
Shenzhen Magic Vision Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Magic Vision Intelligent Technology Co ltd filed Critical Shenzhen Magic Vision Intelligent Technology Co ltd
Priority to CN202310771543.2A priority Critical patent/CN116502611B/en
Publication of CN116502611A publication Critical patent/CN116502611A/en
Application granted granted Critical
Publication of CN116502611B publication Critical patent/CN116502611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a labeling method, a labeling device, equipment and a readable storage medium, wherein the method comprises the steps of receiving a first character string and a second character string to be labeled; carrying out first random disorder recognition on the sub-strings in the first character string which are segmented according to the target sub-string length and the sub-strings in the second character string which are segmented according to the target sub-string length, so as to obtain a corresponding recognition result; performing a loop recognition operation until the target substring length is greater than the length of the first character string or the second character string; and marking the same semantic information for the first character string and the second character string when the finally obtained recognition result represents that the sub-character string in the first character string and the sub-character string in the second character string are randomly disordered. The labeling efficiency can be improved.

Description

Labeling method, labeling device, equipment and readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a labeling method, a labeling device, equipment and a readable storage medium.
Background
When model training is performed in the artificial intelligence field, in order to improve the generalization capability of the model, some random changes and noise are often added into training data to enhance the data. For example, when training a model of vehicle detection, the same vehicle is processed by means of rotation, cutting, splicing, brightness, color adjustment and the like by artificial means, so as to enhance training data. Therefore, the model obtained through training can truly learn the characteristics of the vehicle, and the model has better capability of removing false and true.
Similarly, in natural language processing (Natural language processing, NLP), random perturbation of text, sentences for model training is also required to improve generalization ability of the NLP model. This requires that the device performing the labeling be able to automatically recognize these disturbed texts and sentences and to label these texts and sentences correctly when labeling the texts and sentences used for model training. For example, assuming that the NLP model needs to be trained to recognize text A, text B resulting from random perturbation of text A, and text C other than text B may be placed in the training data. Text a and text B may be referred to as positive samples and text C may be referred to as negative samples. The device performing the labeling needs to identify text a, text B and text C from the training data, and label text a and text B as semantic information a of text a and label text C as other semantic information than semantic information a. Here, the semantic information characterizes the meaning of a text or sentence.
Currently, these disturbed texts are usually identified based on a brute force search recursive algorithm, and this identification method is extremely time-consuming, resulting in low labeling efficiency.
Disclosure of Invention
In view of the foregoing, embodiments of the present invention provide a labeling method, a labeling device, an electronic apparatus, and a computer-readable storage medium, which can improve labeling efficiency.
In one aspect, the present invention provides a labeling method, which includes:
receiving a first character string and a second character string to be marked;
performing first random disorder recognition on the sub-strings in the first character string which are segmented according to the target sub-string length and the sub-strings in the second character string which are segmented according to the target sub-string length to obtain a corresponding recognition result;
performing a loop identification operation until a target substring length is greater than a length of the first string or the second string, wherein the loop identification operation is: adding a specified length to the current target substring length to obtain a new target substring length; and continuing to randomly and randomly perform disorder recognition on the sub-strings which are segmented according to the new target sub-string length in the first character string and the sub-strings which are segmented according to the new target sub-string length in the second character string according to the obtained recognition result to obtain a corresponding recognition result;
And marking the same semantic information for the first character string and the second character string when the finally obtained recognition result represents that the sub-character string in the first character string and the sub-character string in the second character string are randomly disordered.
In another aspect, the present invention also provides a labeling device, including:
the receiving module is used for receiving the first character string and the second character string to be marked;
the initial recognition module is used for carrying out first random disordered recognition on the sub-strings which are segmented according to the target sub-string length in the first character string and the sub-strings which are segmented according to the target sub-string length in the second character string to obtain a corresponding recognition result;
the loop identification module is configured to perform a loop identification operation until a target sub-string length is greater than a length of the first string or the second string, where the loop identification operation is: adding a specified length to the current target sub-string length to obtain a new target sub-string length, and continuously carrying out random disorder pair identification on the sub-string segmented according to the new target sub-string length in the first character string and the sub-string segmented according to the new target sub-string length in the second character string according to the obtained identification result to obtain a corresponding identification result;
And the labeling module is used for labeling the same semantic information for the first character string and the second character string when the last obtained identification result represents that the sub character string in the first character string and the sub character string in the second character string are randomly disordered.
In a further aspect the application provides an electronic device comprising a processor and a memory for storing a computer program which, when executed by the processor, implements a method as described above.
In a further aspect the application provides a computer readable storage medium for storing a computer program which, when executed by a processor, implements a method as described above.
In the technical solutions of some embodiments of the present application, in the first string and the second string to be marked, random disorder pairs are identified according to a manner that the length of the sub-string is from short to long, and finally, an identification result of whether the first string and the second string are random disorder pairs is obtained, so as to mark the first string and the second string. The method is characterized in that the identification of the random disorder pairs is realized from bottom to top by using a dynamic programming algorithm, so that the identification time of the random disorder pairs can be greatly shortened, and the labeling efficiency is further improved.
Drawings
The features and advantages of the present application will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the application in any way, in which:
FIG. 1 is a flow chart of a labeling method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a segmentation of substrings according to an embodiment of the present application;
FIG. 3 is another schematic diagram of a segmentation of the substring of FIG. 2;
FIG. 4 shows a block diagram of an annotation device according to one embodiment of the application;
fig. 5 shows a schematic diagram of an electronic device according to an embodiment of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, based on the embodiments of the application, which a person skilled in the art would obtain without making any inventive effort, are within the scope of the application.
In the present application, text or sentences used for NLP model training are referred to as character strings. The character string may include a minimum constituent unit. The minimum constituent unit is the minimum unit that allows splitting when splitting a character string. One or more consecutive minimum constituent units may constitute one substring. Take the string "I like playing video games" as an example. The words in the string may be the smallest constituent units and one or more consecutive words may constitute a substring. For example, "I", "ilike" may be used as a substring, respectively. Taking the string "xyz" as an example. The characters in the string may be the smallest constituent unit and one or more consecutive characters may constitute a substring. For example, "x", "xy" may be used as a substring, respectively.
The string may also have a length. In some embodiments of the present application, the length of the character string may be calculated according to the minimum number of constituent units included in the character string. For example, the character string "I like playing video games" includes 5 minimum constituent units (5 words), and then the length of the character string is 5. For another example, the character string "xyz" includes 3 minimum constituent units (3 characters), and then the length of the character string is 3.
At present, when training data of an NLP model is enhanced, one method is to perform the following scrambling treatment on a character string s to be identified:
11 From the space position between two consecutive minimum constituent units of the character string s, the character string s is divided into left and right sub-character strings x and y. I.e. the original text s=x+y. Here, "+" indicates that y is spliced to the right of x.
12 Randomly determines whether to exchange the order of x and y. I.e. after this step the resulting string is x+y or y+x.
13 Step 11) to step 13) are continuously performed for each obtained substring on the basis of substrings x and y until the number of minimum constituent units included in the substring is less than or equal to 1.
The character string t is obtained assuming that the character string s is subjected to the above-described scrambling process. Then, the character string s and the character string t may be referred to as random out-of-order pairs. In short, the random disorder pair can be used to characterize a string pair including two strings, and in the two strings, a sub-string in one string is randomly exchanged left and right to obtain the other string.
It should be noted that, if the sub-strings are not exchanged before and after each execution of the step 12), the resulting strings t and s are identical, i.e., each string and itself may form a random out-of-order pair. For example, "video games" includes two words, and "video games" can be obtained by not exchanging the left and right order of the two words, i.e., "video games" and "video games" can form a random out-of-order pair. For another example, "play" includes only one word, and does not involve a left-right sequential exchange between different words, i.e., "play" and "play" themselves may constitute random out-of-order pairs.
When the training data of the NLP model is marked, the character string t obtained by disturbing the character string s to be recognized needs to be marked as semantic information of the character string s to be recognized. In some techniques, character strings forming random disorder pairs with the character string s to be identified are identified through a violent search recursive algorithm, and the character strings are marked as semantic information of the character string s to be identified. This method of identification is time consuming and reduces labeling efficiency.
In view of the above, the present application provides a labeling method, which can improve labeling efficiency. The labeling method can be applied to the electronic equipment. Electronic devices include, but are not limited to, desktop computers, notebook computers, tablet computers, servers, and the like. Referring to fig. 1, a flowchart of a labeling method according to an embodiment of the application is shown. In fig. 1, the labeling method includes the following steps:
step S11, a first character string and a second character string to be marked are received.
Specifically, in the first character string and the second character string, one of the character strings is a character string to be identified in model training. For example, if the NLP model needs to be trained to recognize the character string "I like playing video games", one of the first character string and the second character string is "I like playing video games", and the other character string may be a character string forming a random disorder pair with the character string "I like playing video games", or another character string not forming a random disorder pair with the character string "I like playing video games". Under the condition that the first character string and the second character string are randomly disordered pairs, the semantic information of the two character strings can be marked as the semantic information of the character string to be identified; in the case of a first string and a second string that are not randomly disordered pairs, the two strings may be labeled with different semantic information. As such, to facilitate model recognition.
The following description will be made on how to identify whether the first character string and the second character string are randomly out-of-order pairs.
In connection with the above scrambling process flows of steps 11) to 13), it will be understood that, firstly, the two strings in the random out-of-order pair should be of equal length, and secondly, the minimum constituent units included in the two strings should be identical and the number of occurrences of the respective minimum constituent units should be equal (only in different orders). For example, "I like playing playing video games" and "playing video games I like playing" are random out-of-order pairs, the two strings are each 6 in length and each include 1I, 1 like, 1 video, one game, 2 play, i.e., the two strings are identical in length, the smallest constituent units included are identical, and the number of occurrences of each smallest constituent unit is also identical. However, when two character strings have the same length, the minimum constituent units included are the same, and the number of occurrences of each minimum constituent unit is also the same, the two character strings are not necessarily randomly disordered pairs. For example, assuming that the first character string is "I like playing video games" and the second character string is "I like video playing games", the two character strings have equal lengths, the minimum constituent units included therein are the same, and the number of occurrences of each minimum constituent unit is equal, but after the sub-character string of one character string is randomly exchanged from side to side, the other character string cannot be obtained. Therefore, the two character strings may be insufficient conditions for random pairs, in which the two character strings have the same length, the same minimum constituent units are included, and the same number of occurrences of each minimum constituent unit.
In view of this, in some embodiments, random out-of-order pair identification may include both pre-identification and subsequent identification phases. The pre-recognition primarily performs a preliminary recognition of the first character string and the second character string based on the character string length and a minimum constituent unit included in the character string. After the pre-recognition is passed, steps S12 to S14 are performed again for the subsequent recognition. In this way, the data processing amount in the subsequent recognition can be reduced.
Specifically, the pre-recognition may include:
comparing the lengths of the first character string and the second character string;
dividing the same minimum constituent unit in the first character string into one class and dividing the same minimum constituent unit in the second character string into one class in the case that the first character string and the second character string have the same length;
when the first character string and the second character string have the same minimum constituent unit category and the minimum constituent unit number in each category is the same, the random out-of-order pair recognition in steps S12 to S14 is performed.
For ease of understanding, the process of pre-recognition is described below by way of example. Let the first string be "I like playing video games". For a second, different character string, the pre-recognition process is as follows:
In the case where the second character string is "I like playing games", since the lengths of the first character string and the second character string are not the same, it can be determined that the pre-recognition process of the first character string and the second character string is not passed, and it is unnecessary to continue to perform steps S12 to S14.
In the case where the second character string is "I like playing football games", although the lengths of the first character string and the second character string are the same, the first character string includes 1I, 1 like, 1 play, 1 video, 1 game, and the second character string includes 1I, 1 like, 1 play, 1 football, 1 game, it is obvious that the minimum constituent unit categories included in the two character strings are not identical, and thus it can be determined that the pre-recognition process of the first character string and the second character string is not passed, and it is unnecessary to continuously perform steps S12 to S14.
In the case where the second character string is "video games I like playing", since the first character string and the second character string are identical in length and include 1I, 1 like, 1 play, 1 video, 1 games, respectively, it can be determined that the pre-recognition process of the first character string and the second character string is passed, and steps S12 to S14 can be continued to be performed.
Step S12, carrying out first random disorder recognition on the sub-strings in the first character string which are segmented according to the target sub-string length and the sub-strings in the second character string which are segmented according to the target sub-string length, and obtaining a corresponding recognition result.
Specifically, the substring length may refer to the number of minimum constituent units included in the substring. For example, the substring split from the character string "I like playing video games" by the length 2 may be "I like", "like playing", "video games", or the like.
In this embodiment, the target string length is denoted by k, and the initial value of k is 1. When the first random disorder pair identification is performed, the sub-strings in the first character string and the sub-strings of the second character string can be compared, the same sub-string pair is identified as a random disorder pair, and the different sub-string pair is identified as a non-random disorder pair. In short, when the first random disorder is identified, each minimum constituent unit in the first character string is respectively compared with each minimum constituent unit in the second character string in an identical manner. If the two minimum constituent units are identical, identifying the two minimum constituent units as random out-of-order pairs; if the two minimum constituent units are not identical, then the two minimum constituent units are identified as non-random out-of-order pairs.
Specifically, the recognition result may be stored by an array dp [ i ] [ j ] [ k ]. dp [ i ] [ j ] [ k ] represents a recognition result obtained by randomly recognizing, in order from left to right, a sub-string split from the i-th minimum constituent unit of the first string according to the target string length k and a sub-string split from the j-th minimum constituent unit of the second string according to the target string length k. In computer programming, the sequence numbers start at 0, so i ranges from 0 to m-1, and j ranges from 0 to n-1.m represents the length of the first string and n represents the length of the second string. In the first random out-of-order pair identification, the value of k is 1. For example, assuming that the first string is "I like playing video games" and the second string is "video games playing I like", in the first random out-of-order pair identification, the following identification result may be obtained:
dp[0][0][1]=0,dp[0][1][1]=0,dp[0][2][1]=0,dp[0][3][1]=1,
……
dp[1][0][1]=0,dp[1][1][1]=0,dp[1][2][1]=0,dp[1][3][1]=0,
……
dp[4][4][1]=0
in the above examples, 0 represents a non-random pair of disorder and 1 represents a random pair of disorder. Such as:
dp [0] [0] [1] =0 means that the sub-string (i.e. I) segmented according to the target string length 1 from the 0 th minimum constituent unit of the first string and the sub-string (i.e. video) segmented according to the target string length 1 from the 0 th minimum constituent unit of the second string are subjected to random disorder pair identification, and then the identification result is that the two sub-strings (i.e. I and video) are subjected to non-random disorder pair identification;
dp [0] [3] [1] =1 represents that the sub-string (i.e., I) to be split according to the target string length 1 from the 0 th minimum constituent unit of the first string and the sub-string (i.e., I) to be split according to the target string length 1 from the 3 rd minimum constituent unit of the second string are randomly disordered pair-recognized, and the recognition result is that the two sub-strings (i.e., I and I) are randomly disordered pairs.
In summary, the first random out-of-order pair identification is completed. After the first random out-of-order pair identification is performed, the corresponding identification results may be saved so that these identification results may be used when the subsequent random out-of-order pair identification is performed in step S13. In some embodiments of the present application, the corresponding recognition results may be respectively stored for each random out-of-order pair recognition.
In the recognition result in step S12, if dp [ i ] [ j ] [1] is 1 when i is equal to j, that is, dp [0] [0] [1] [1], dp [1] [1] [1], dp [2] [2] [1] … … are 1, it means that the first character string and the second character string are the same character string. Based on the above description of steps 11) to 13), the same character strings are randomly disordered pairs. In this case, the first character string and the second character string may be directly labeled with the same semantic information without executing the subsequent step S13 and step S14. If i is equal to j, dp [ i ] [ j ] [1] is not all 1, which means that the first string and the second string are not identical strings, but the first string and the second string may still be randomly disordered pairs, so that the step S13 and the step S14 may be executed to continue the judgment.
Step S13, executing a loop identification operation until the target substring length is greater than the length of the first character string or the second character string, wherein the loop identification operation is as follows: adding a specified length to the current target substring length to obtain a new target substring length; and continuing to randomly and randomly perform disorder recognition on the sub-strings which are segmented according to the new target sub-string length in the first character string and the sub-strings which are segmented according to the new target sub-string length in the second character string according to the obtained recognition result, so as to obtain a corresponding recognition result.
In the present embodiment, the specified length is 1. That is, in the second random pair identification, the value of k becomes 2, in the third random pair identification, the value of k becomes 3, and so on. Meanwhile, as the number of times of random out-of-order pair identification increases, the maximum values of i and j are sequentially reduced. Take the second random out-of-order pair identification as an example. Since the length of the segmented substring becomes 2 in the second random disorder pair identification, in order to ensure that the substring with the length of 2 can be segmented each time, the maximum value of i and j needs to be reduced by 1 relative to the first random disorder pair identification. I.e. i has a value ranging from 0 to m-2 and j has a value ranging from 0 to n-2.m represents the length of the first string and n represents the length of the second string. In short, since the sub-string having the length of 2 cannot be split from the n-1 th minimum constituent unit (i.e., the last minimum constituent unit) of the first string after the split sub-string has become 2, the maximum value of i and j needs to be reduced by 1 in the second random out-of-order pair identification.
Based on the above description, in some embodiments, the random out-of-order pair identification is continued in a loop identification operation, including:
dividing the substring in the first character string into a front first target substring and a rear first target substring according to the appointed dividing position, and dividing the substring in the second character string into a front second target substring and a rear second target substring;
the substring in which the first target substring is located and the substring in which the second target substring is located are identified as random out-of-order pairs in the following cases:
under the condition that the lengths of the former first target substring and the former second target substring are the same, the former first target substring and the former second target substring are random disordered pairs, and the latter first target substring and the latter second target substring are random disordered pairs; or (b)
Under the condition that the lengths of the former first target substring and the latter second target substring are the same, the former first target substring and the latter second target substring are random disordered pairs, and the latter first target substring and the former second target substring are random disordered pairs.
Specifically, the dicing position refers to a position between two adjacent minimum constituent units. The segmentation positions of the first character string and the second character are the same. That is, when the sub-string is split, the integrity of the minimum constituent unit needs to be ensured, and one minimum constituent unit cannot be split into two parts. It can be understood that after the sub-strings are segmented, whether the first target sub-string and the second target string are random disorder pairs or not is already identified in the previous random disorder pair identification, so that based on the identification result, whether the first target sub-string and the second target string are random disorder pairs or not can be determined. In the first target substring and the second target string obtained by segmentation, as long as one pair of the first target substring and the second target string is a random disorder pair, the substring where the first target substring is located and the substring where the second target substring is located can be determined to be identified as the random disorder pair.
For ease of understanding, still taking the first string "I like playing video games" and the second string "video games playing I like" as examples, in the second random disorder pair identification, whether dp [0] [0] [2], dp [0] [1] [2], dp [0] [3] [2], dp [1] [0] [2], dp [1] [1] [2], dp [1] [2] [2], dp [1] [3] [2], … …, dp [3] [3] [2] are random disorder pairs is determined according to the first random disorder pair identification result.
Taking dp [0] [0] [2] as an example. dp [0] [0] [2] is a method for judging whether the substring "I like" and "video games" are random out-of-order pairs. After the substring 'I like' is segmented, the obtained front and rear two first target substrings are 'I' and 'like', and after the substring 'video games' is segmented, the obtained front and rear two second target substrings are 'video' and 'games'. Based on the first random out-of-order pair identification, the identification result as shown in table 1 can be known.
Table 1 results of first random out-of-order pair identification
Based on Table 1, after the substrings "I like" and "video games" are split, there is no first target substring and second target substring that are random out-of-order pairs, so dp [0] [0] [2] =0.
Continuing with dp [0] [3] [2], examples are given. dp [0] [3] [2] is a method for judging whether the substrings "I like" and "I like" are random disordered pairs. Based on the first random out-of-order pair identification, the identification result as shown in table 2 can be known.
Table 2 recognition results of first random out-of-order pair recognition
Based on table 1, after the substrings "I like" and "I like" are segmented, the first previous target substring "I" and the second previous target substring "I" are randomly disordered pairs, and the first next target substring "like" and the second next target substring "like" are randomly disordered pairs, so the substrings "I like" and "I like" are randomly disordered pairs, i.e., dp [0] [3] [2] =1.
According to the above identification method, and so on, after the first string "I like playing video games" and the second string "video games playing I like" are identified by the second random disorder, the obtained identification result may be as follows:
dp[0][0][2]=0,dp[0][1][2]=0,dp[0][2][2]=0,dp[0][3][2]=1,
……
dp[3][3][2]=0
after the second random disorder pair identification is completed, a corresponding identification result can be saved. Therefore, when the third random disorder pair is identified, the identification of the random disorder pair can be continued according to the identification results of the first random disorder pair and the second random disorder pair.
Further, when the random disorder pair identification is performed for the third time or more, since each sub-string includes a plurality of segmentation positions, a proper segmentation position needs to be determined before segmentation, so as to avoid invalid segmentation and increase data processing capacity. Based on the above steps of scrambling and definition for the random disorder pair, it can be understood that after one sub-string is split, the front and rear parts of the split are based on random left-right exchange, so as to obtain another sub-string forming the random disorder pair with the sub-string. There are two situations. One case is that the front and rear parts are not exchanged, and another sub-string which is the same as the sub-string is obtained; another case is that the front and back parts are exchanged to obtain another substring which is different from the substring. For example, the substring s is split into x and y, and the other substring may be x+y or y+x. x+y may form a random scrambling pair with substring s and y+x may form another random scrambling pair with substring s.
In view of this, reference is made to fig. 2 and 3 in combination. Fig. 2 is a schematic diagram of segmentation of a substring according to an embodiment of the present application. Fig. 3 is another schematic diagram of segmentation of the substring of fig. 2. In fig. 2 and 3, the segmentation position is distinguished by color, and after the substring ss is segmented into ssx and ssy, another substring forming a random disorder pair with the substring ss may be ssx + ssy or ssy + ssx. Thus, to determine whether the substring st forms a random out-of-order pair with the substring ss, the substring st exists in both of the splitting manners of fig. 2 and 3.
In summary, in some embodiments, the slicing position may be determined based on the following method:
in the sub-character strings of the first character string and the second character string, starting from the initial position of the sub-character string, determining the position separated from the initial position by the segmentation length as the segmentation position; or (b)
In the substring of the first string, a position separated from the initial position by a segmentation length is determined as a segmentation position from the initial position, and in the substring of the second string, a position separated from the end position by a segmentation length is determined as a segmentation position from the end position. Wherein the value of the segmentation length comprises an integer value between 1 and k, wherein k represents the length of the substring to be segmented.
For ease of understanding, still taking the first string "I like playing video games" and the second string "video games playing I like" as examples, in the third random disorder pair identification, it is determined whether dp [0] [0] [3], dp [0] [1] [3], dp [0] [2] [3], … …, dp [2] [2] [3] are random disorder pairs according to the first random disorder pair identification result.
Taking dp [0] [0] [3] as an example. dp [0] [0] [3] is to judge whether the substring "I like play" and "video games playing" are random out-of-order pairs. The specific identification process may be as follows:
21 Splitting the substring "I like play" into "I" and "like play", and splitting the substring "video games playing" into "video" and "game play". Since the previous first target string and the previous second target string have the same length, it is necessary to identify whether the following two sets of target strings are random out-of-order pairs:
a first group: "I" and "video"
Second group: "like playing" and "gas playing"
According to the recognition results of the first random disorder pair recognition and the second random disorder pair recognition, in the case that the two sets of character strings in the determining step 21) are both random disorder pairs, the sub-character strings "I like playing" and "video games playing" can be determined to be random disorder pairs, and it is obvious that neither of the two sets of character strings is a random disorder pair, and therefore, the subsequent operation is continued.
22 Splitting the substring "I like play" into "I" and "like play", and splitting the substring "video games playing" into "video games" and "play". Since the length of the first and second preceding target strings is the same, it is necessary to identify whether the following two sets of target strings are random out-of-order pairs:
A first group: "I" and "playing"
Second group: "like playing" and "video games"
Similarly, in the case where the two strings in the determining step 22) are both randomly disordered pairs, the substrings "I like play" and "video games playing" may be determined to be randomly disordered pairs, and it is obvious that neither of the two strings is randomly disordered pairs, so that the subsequent operations are continued.
23 Splitting the substring "I like play" into "I like" and "play", and splitting the substring "video games playing" into "video games" and "play". Since the previous first target string and the previous second target string have the same length, it is necessary to identify whether the following two sets of target strings are random out-of-order pairs:
a first group: "I like" and "video games"
Second group: "playing" and "playing"
Similarly, in the case that the two strings in the determining step 23) are both random disorder pairs, the sub-strings "I like play" and "video games playing" may be determined to be random disorder pairs, and it is obvious that only one of the two strings is a random disorder pair, so that the sub-strings "I like play" and "video games playing" cannot be determined to be random disorder pairs, and the following operations need to be continuously performed
24 Splitting the substring "I like play" into "I like" and "play" and splitting the substring "video games playing" into "video" and "game play". Since the length of the first and second preceding target strings is the same, it is necessary to identify whether the following two sets of target strings are random out-of-order pairs:
a first group: "I like" and "gas play"
Second group: "playing" and "video"
Similarly, in the case where the two strings in the determining step 24) are both random scrambling pairs, the substrings "I like playing" and "video games playing" may be determined to be random scrambling pairs, and it is obvious that neither of the two strings is a random scrambling pair.
Based on the above step 21) to step 24) recognition, it can be determined that the substrings "I like play" and "video games playing" are not random out-of-order pairs, i.e., dp [0] [0] [3] =spase.
According to the above identification method, and so on, after the first string "I like playing video games" and the second string "video games playing I like" are identified by the third random disorder, the obtained identification result may be as follows:
dp[0][0][3]=Flase,dp[0][1][3]=Flase,dp[0][2][3]=1、……、dp[2][2][3]=Flase。
And so on, continuing to execute the fourth and fifth random out-of-order pair identification until the target substring length k is greater than the length of the first character string or the second character string.
It can be understood that in the last random disorder pair identification, the obtained result is the result of whether the first character string and the second character string are random disorder pairs. For example, after the fifth random disorder pair identification is performed on the first string "I like playing video games" and the second string "video games playing I like", the result is dp [0] [0] [5] =1, which indicates that the first string "I like playing video games" and the second string "video games playing I like" are random disorder pairs.
Step S14, marking the same semantic information for the first character string and the second character string when the last obtained recognition result represents that the sub character string in the first character string and the sub character string in the second character string are randomly disordered.
Specifically, when the finally obtained recognition result represents that the sub-strings in the first character string and the sub-strings in the second character string are randomly disordered, marking the semantic information of the first character string and the second character string as the semantic information of the character string to be recognized; and marking different semantic information for the first character string and the second character string when the finally obtained recognition result represents that the sub-character string in the first character string and the sub-character string in the second character string are in non-random disordered time. As such, to facilitate model recognition.
In summary, in the technical solutions of some embodiments of the present application, in the first string and the second string to be marked, the random disorder pair is identified according to the manner that the length of the sub-string is from short to long, and finally, the identification result of whether the first string and the second string are the random disorder pair is obtained, so as to mark the first string and the second string. The method is characterized in that the identification of the random disorder pairs is realized from bottom to top by using a dynamic programming algorithm, so that the identification time of the random disorder pairs can be greatly shortened, and the labeling efficiency is further improved. For example, in some techniques, a brute force search recursive algorithm is typically employed to identify random out-of-order pairs. This recognition method may have problems of repeated recognition, such as performing recognition on the character string a and the character string B a plurality of times, which results in extremely time-consuming overall recognition process. In the application, the random disorder pairs are identified by adopting a dynamic programming algorithm, so that the problem of repeated identification can be avoided, the identification time can be greatly shortened, and the purpose of improving the labeling efficiency is achieved.
Referring to fig. 4, a schematic block diagram of an labeling device according to an embodiment of the application is shown. The labeling device comprises:
The receiving module is used for receiving the first character string and the second character string to be marked;
the initial recognition module is used for carrying out first random disorder recognition on the sub-strings which are segmented according to the target sub-string length in the first character string and the sub-strings which are segmented according to the target sub-string length in the second character string to obtain a corresponding recognition result;
the loop identification module is used for executing loop identification operation until the length of the target substring is greater than that of the first character string or the second character string, wherein the loop identification operation is as follows: adding a specified length to the current target sub-string length to obtain a new target sub-string length, and continuously carrying out random disorder pair identification on the sub-strings segmented according to the new target sub-string length in the first character string and the sub-strings segmented according to the new target sub-string length in the second character string according to the obtained identification result to obtain a corresponding identification result;
and the labeling module is used for labeling the same semantic information for the first character string and the second character string when the last obtained identification result represents that the sub character string in the first character string and the sub character string in the second character string are randomly disordered.
Referring to fig. 5, a schematic diagram of an electronic device according to an embodiment of the application is provided. The electronic device comprises a processor and a memory for storing a computer program which, when executed by the processor, implements the method described above.
The processor may be a central processing unit (Central Processing Unit, CPU). The processor may also be any other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules, corresponding to the methods in embodiments of the present application. The processor executes various functional applications of the processor and data processing, i.e., implements the methods of the method embodiments described above, by running non-transitory software programs, instructions, and modules stored in memory.
The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data created by the processor, etc. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method.
Although embodiments of the present application have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the application, and such modifications and variations fall within the scope of the application as defined by the appended claims.

Claims (10)

1. A method of labeling, the method comprising:
receiving a first character string and a second character string to be marked;
carrying out first random disorder recognition on the sub-strings in the first character string which are segmented according to the target sub-string length and the sub-strings in the second character string which are segmented according to the target sub-string length to obtain a corresponding recognition result, wherein the random disorder representation comprises character string pairs of two character strings, and in the two character strings, the sub-strings in one character string are subjected to random left-right exchange to obtain the other character string;
performing a loop identification operation until a target substring length is greater than a length of the first string or the second string, wherein the loop identification operation is: adding a specified length to the current target substring length to obtain a new target substring length; and continuing to randomly and randomly perform disorder recognition on the sub-strings which are segmented according to the new target sub-string length in the first character string and the sub-strings which are segmented according to the new target sub-string length in the second character string according to the obtained recognition result to obtain a corresponding recognition result;
And marking the same semantic information for the first character string and the second character string when the finally obtained recognition result represents that the sub-character string in the first character string and the sub-character string in the second character string are randomly disordered.
2. The method of claim 1, wherein the performing a first random out-of-order pair identification comprises:
comparing the substring of the first character string with the substring of the second character string, identifying the same substring pair as a random disorder pair, and identifying the different substring pair as a non-random disorder pair.
3. The method of claim 1, wherein continuing random out-of-order pair identification in the loop identification operation comprises:
dividing the substring in the first character string into a front first target substring and a rear first target substring according to the appointed dividing position, and dividing the substring in the second character string into a front second target substring and a rear second target substring;
the substring in which the first target substring is located and the substring in which the second target substring is located are identified as random out-of-order pairs in the following cases:
under the condition that the lengths of the former first target substring and the former second target substring are the same, the former first target substring and the former second target substring are random disordered pairs, and the latter first target substring and the latter second target substring are random disordered pairs; or (b)
Under the condition that the lengths of the former first target substring and the latter second target substring are the same, the former first target substring and the latter second target substring are random disordered pairs, and the latter first target substring and the former second target substring are random disordered pairs.
4. The method of claim 3, wherein the slicing position is determined based on the following method:
determining the position separated from the initial position by the segmentation length as the segmentation position from the initial position of the sub-character strings of the first character string and the second character string; or (b)
In the substring of the first string, a position separated from the initial position by the segmentation length is determined as the segmentation position from the initial position of the substring, and in the substring of the second string, a position separated from the end position by the segmentation length is determined as the segmentation position from the end position of the substring.
5. The method of claim 4, wherein the value of the segmentation length comprises an integer value between 1 and k, where k represents the length of the substring to be segmented.
6. The method of claim 1, wherein the first string and the second string each comprise a minimum constituent unit;
prior to performing the random out-of-order pair identification, the method further comprises:
comparing the lengths of the first character string and the second character string;
dividing the same minimum constituent units in the first character string into one type and dividing the same minimum constituent units in the second character string into one type under the condition that the first character string and the second character string have the same length;
the random out-of-order pair identification is performed when the first character string and the second character string have the same minimum constituent unit category and the minimum constituent unit number in each category is the same.
7. The method of claim 1, wherein when the resulting recognition result characterizes that the substring in the first string is non-random out-of-order versus the substring in the second string, different semantic information is annotated to the first string and the second string.
8. An labeling device, the device comprising:
The receiving module is used for receiving the first character string and the second character string to be marked;
the initial recognition module is used for carrying out first random disorder recognition on the sub-strings which are segmented according to the target sub-string length in the first character string and the sub-strings which are segmented according to the target sub-string length in the second character string to obtain a corresponding recognition result, wherein the random disorder representation comprises character string pairs of two character strings, and in the two character strings, the sub-strings in one character string are subjected to random left-right exchange to obtain the other character string;
the loop identification module is configured to perform a loop identification operation until a target sub-string length is greater than a length of the first string or the second string, where the loop identification operation is: adding a specified length to the current target sub-string length to obtain a new target sub-string length, and continuously carrying out random disorder pair identification on the sub-string segmented according to the new target sub-string length in the first character string and the sub-string segmented according to the new target sub-string length in the second character string according to the obtained identification result to obtain a corresponding identification result;
And the labeling module is used for labeling the same semantic information for the first character string and the second character string when the last obtained identification result represents that the sub character string in the first character string and the sub character string in the second character string are randomly disordered.
9. A computer readable storage medium for storing a computer program which, when executed by a processor, implements the method of any one of claims 1 to 7.
10. An electronic device comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the method of any of claims 1 to 7.
CN202310771543.2A 2023-06-28 2023-06-28 Labeling method, labeling device, equipment and readable storage medium Active CN116502611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310771543.2A CN116502611B (en) 2023-06-28 2023-06-28 Labeling method, labeling device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310771543.2A CN116502611B (en) 2023-06-28 2023-06-28 Labeling method, labeling device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN116502611A CN116502611A (en) 2023-07-28
CN116502611B true CN116502611B (en) 2023-12-05

Family

ID=87325252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310771543.2A Active CN116502611B (en) 2023-06-28 2023-06-28 Labeling method, labeling device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116502611B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484391A (en) * 2014-12-11 2015-04-01 北京国双科技有限公司 Method and device for calculating similarity of character strings
CN104796354A (en) * 2014-11-19 2015-07-22 中国科学院信息工程研究所 Out-of-order data packet string matching method and system
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device
CN114692594A (en) * 2022-04-18 2022-07-01 上海喜马拉雅科技有限公司 Text similarity recognition method and device, electronic equipment and readable storage medium
CN115204889A (en) * 2021-04-07 2022-10-18 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN116150333A (en) * 2022-11-14 2023-05-23 马上消费金融股份有限公司 Text matching method, device, electronic equipment and readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2456369A (en) * 2008-01-11 2009-07-15 Ibm String pattern analysis for word or genome analysis
US9053050B2 (en) * 2011-12-02 2015-06-09 Synopsys, Inc. Determining a desirable number of segments for a multi-segment single error correcting coding scheme

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104796354A (en) * 2014-11-19 2015-07-22 中国科学院信息工程研究所 Out-of-order data packet string matching method and system
CN104484391A (en) * 2014-12-11 2015-04-01 北京国双科技有限公司 Method and device for calculating similarity of character strings
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device
CN115204889A (en) * 2021-04-07 2022-10-18 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN114692594A (en) * 2022-04-18 2022-07-01 上海喜马拉雅科技有限公司 Text similarity recognition method and device, electronic equipment and readable storage medium
CN116150333A (en) * 2022-11-14 2023-05-23 马上消费金融股份有限公司 Text matching method, device, electronic equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于有效子串标注的中文分词;赵海;揭春雨;;中文信息学报(第05期);第10-15页 *

Also Published As

Publication number Publication date
CN116502611A (en) 2023-07-28

Similar Documents

Publication Publication Date Title
WO2019153551A1 (en) Article classification method and apparatus, computer device and storage medium
WO2021212675A1 (en) Method and apparatus for generating adversarial sample, electronic device and storage medium
WO2021227831A1 (en) Method and apparatus for detecting subject of cyber threat intelligence, and computer storage medium
CN110321537B (en) Method and device for generating file
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
CN112465008A (en) Voice and visual relevance enhancement method based on self-supervision course learning
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
CN107861948B (en) Label extraction method, device, equipment and medium
CN109948140B (en) Word vector embedding method and device
CN111241230A (en) Method and system for identifying string mark risk based on text mining
US20150169539A1 (en) Adjusting Time Dependent Terminology in a Question and Answer System
CN113051368B (en) Double-tower model training method, retrieval device and electronic equipment
US11756301B2 (en) System and method for automatically detecting and marking logical scenes in media content
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN112148862B (en) Method and device for identifying problem intention, storage medium and electronic equipment
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN111488813B (en) Video emotion marking method and device, electronic equipment and storage medium
CN114064948A (en) Hash image retrieval method and device based on generalized average pooling strategy
CN110532388B (en) Text clustering method, equipment and storage medium
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
CN113609847B (en) Information extraction method, device, electronic equipment and storage medium
US20180150748A1 (en) Enhanced Ingestion of Question-Answer Pairs into Question Answering Systems by Preprocessing Online Discussion Sites
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
CN113627179A (en) Threat information early warning text analysis method and system based on big data
CN116502611B (en) Labeling method, labeling device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant