WO2020098099A1 - Text accuracy calculation method and apparatus based on semantic parsing, and computer device - Google Patents

Text accuracy calculation method and apparatus based on semantic parsing, and computer device Download PDF

Info

Publication number
WO2020098099A1
WO2020098099A1 PCT/CN2018/124399 CN2018124399W WO2020098099A1 WO 2020098099 A1 WO2020098099 A1 WO 2020098099A1 CN 2018124399 W CN2018124399 W CN 2018124399W WO 2020098099 A1 WO2020098099 A1 WO 2020098099A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
distance matrix
characters
value
edit distance
Prior art date
Application number
PCT/CN2018/124399
Other languages
French (fr)
Chinese (zh)
Inventor
吴建财
邹芳
邢艳
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020098099A1 publication Critical patent/WO2020098099A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of semantic parsing, in particular to a method, device, and computer equipment for calculating text accuracy based on semantic parsing.
  • the commonly used algorithm is the edit distance algorithm.
  • the algorithm calculates the similarity between the transferred text and the template text by counting the minimum edit operations (the edit operations include: replacing a character with another character, inserting a character, and deleting a character) times. Degree (transfer accuracy).
  • the calculation result of the algorithm is not satisfactory. Because the algorithm always compares the text that has been transferred with the entire text of the template text, when only part of the text is transferred, the algorithm cannot accurately calculate the text of the part of the transferred text Transfer accuracy. Therefore, the editing distance is not applicable in the scenario where the real-time transfer accuracy of the ASR engine is concerned.
  • this application proposes a method, device, and computer equipment for calculating text accuracy based on semantic analysis, which aims to solve the existing text transfer accuracy algorithm and convert the text that has been transferred and the template text Compare all the texts, and when some texts are transferred, the accuracy of the text transfer cannot be accurately calculated.
  • a text accuracy calculation method based on semantic analysis includes:
  • the length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of lines, and an editing distance matrix is established;
  • the first trajectory determine the start point and end point of the partially transferred text on the template text, and obtain the first start point and the first end point;
  • the present application also provides a text accuracy calculation device based on semantic analysis.
  • the device includes:
  • the first obtaining module is used to obtain the part of the transferred text starting from any position of the template text except the starting point;
  • a first calculation module configured to calculate the value of each element in the edit distance matrix according to the partially transferred text and the template text
  • a generating module used to record the calculated trajectory of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix;
  • the screening module is used to calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text, and obtain a first trajectory;
  • An obtaining module configured to determine a starting point and an end point of the partially transferred text on the template text according to the first trajectory, and obtain a first starting point and a first end point;
  • a second obtaining module configured to obtain a new template text from the template text according to the first start point and the first end point;
  • the second calculation module is used to compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
  • the present application also provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the steps of any one of the above methods are implemented.
  • the present application also provides a computer non-volatile readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method described in any one of the above are implemented.
  • the present application has a beneficial effect: when the template text starts to be transferred at any position other than the starting point, an editing distance matrix is established, the values of each element in the editing distance matrix are calculated, and according to each element in the editing distance matrix Calculate the trajectory of the value to generate a trajectory matrix, calculate the similarity of each trajectory in the trajectory matrix, filter the trajectory with the highest similarity to obtain the first trajectory, and obtain the corresponding start and end points of the partially transferred text on the template text according to the first trajectory, In order to obtain the new template text, and then compare the part of the transferred text with the new template text, calculate the accuracy of the part of the transferred text, aiming to solve the existing text transfer accuracy algorithm, the text has been transferred and The entire text of the template text is compared. When part of the text is transferred, the accuracy of the text transfer cannot be accurately calculated.
  • FIG. 1 is a flowchart of applying a text resolution calculation method based on semantic analysis provided by an embodiment of the present application
  • FIG. 2 is a functional block diagram of a text accuracy calculation device based on semantic analysis provided by an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a structure of a computer device provided by using an embodiment of the present application.
  • an embodiment of the present application provides a method for calculating text accuracy based on semantic analysis.
  • the method includes the following steps:
  • Step S101 Acquire part of the transferred text starting from any position of the template text except the starting point.
  • the template text is transferred from any position other than the starting point, and the template text is not all transferred, that is, it is transferred from any character of the template text, but does not include the first character. If it is transferred from the non-first character of the template text, the end point of the transfer is any character after the one character that was started to be transferred in the template text, where one character that was started to be transferred in the template text Any subsequent character includes a character that is initially transferred in the template text.
  • the template text is a correct text and is used to compare the text with the part of the transliterated text.
  • the above-mentioned transliteration refers to transcribing speech into text through an ASR (speech recognition) engine.
  • step S102 the length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of rows to establish an edit distance matrix.
  • the template text is text text excluding punctuation marks.
  • Part of the transliterated text is text with punctuation marks removed.
  • Get the length of partially transferred text characters increase the length of two characters according to the length of partially transferred text characters, as the number of lines, and then increase the length of the two characters of the template text characters as the number of columns, and partially transfer
  • the length of the text characters is increased by the length of two characters to the number of lines, and an edit distance matrix is established.
  • the length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of rows.
  • the purpose is to enter the template text on the first row and the first column, Partially transfer text, and enter the initial value in the second row and second column.
  • step S102 after step S102 and before step S103, it includes:
  • the value of each element in the second column of the edit distance matrix is sequentially incremented by a value of 1 to initialize the value of each element in the second column of the edit distance matrix.
  • the characters of the template text in the first row of the editing distance matrix specifically, the characters of the template text starting from the third element of the first row of the editing distance matrix.
  • the characters of the partially transferred text are input in the first column of the edit distance matrix, specifically, the characters of the partially transferred text are input starting from the third element of the first column of the edit distance matrix.
  • the first row of the edit distance matrix and the third element of the first column start to input the characters of the template text and the characters of the partially transferred text, so that each character of the template text and the characters of the partially transferred text exist in the edit distance matrix
  • the corresponding relationship is also to provide a corresponding positional relationship for the initial values of the second row and the second column.
  • the value of the second element in the second row of the edit distance matrix as 0.
  • increment the value 1 by the value of the second element in the second row of the edit distance matrix as 0 to initialize the edit distance matrix
  • the values of the elements in the second row of, for example, the values of the second, third, fourth, and fifth elements in the second row of the edit distance matrix are 0, 1, 2, and 3, respectively.
  • the value of the second element in the second row of the edit distance matrix is defined as 0.
  • the value of the second element in the second column of the edit distance matrix is also defined as 0, because the second row of the edit distance matrix
  • the second element in the editing distance matrix is in the same position as the second element in the second column of the edit distance matrix, even if the same element, the second element in the second column of the editing distance matrix is in order of 0
  • the values of the second, third, fourth, and fifth elements in the second column of the edit distance matrix are 0, 1, 2, and 3, respectively. . After initializing the values in the second column and second row of the edit distance matrix, it is possible to calculate the values of the elements in the edit distance matrix.
  • Step S103 Calculate the value of each element in the edit distance matrix according to the partially transferred text and the template text.
  • the calculation method of the value of each element in the edit distance matrix is determined, and then the edit distance matrix is calculated The value of each element in.
  • the value of each element in the edit distance matrix that has not been initialized is determined by the value of one of the elements on the left, top left, and top.
  • step S103 it includes:
  • the value of the third element in the third column of the edit distance matrix is the value of the element in the upper left corner
  • the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus 1;
  • each element in the editing distance matrix that has not been initialized is determined by the value of one of its left, upper left, and upper elements, at the beginning of the calculation, there are values for the elements that match the left, upper left, and upper sides. Is only the third element in the third column of the edit distance matrix, or the third element in the third column of the edit distance matrix.
  • the third element in the third column of the edit distance matrix is calculated Elements, identifying the number of columns and rows of the third element in the third column of the edit distance matrix, after obtaining the number of columns and rows of the third element in the third column of the edit distance matrix , Identify the number of columns in the third element of the third column of the edit distance matrix corresponding to the characters of the template text, and the number of rows in the third element of the third column of the edit distance matrix corresponding to the part of the transferred text character.
  • the number of columns in the third element of the third column of the edit distance matrix corresponds to the number of characters in the template text corresponds to the number of rows in the third element of the third column of the edit distance matrix Whether the characters of the partially transferred text are equal, according to whether the characters of the template text of the third element in the third column of the edit distance matrix are equal to the characters of the corresponding partially transferred text, used to determine the third column of the edit distance matrix
  • the value of the third element in the if the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix is in the row If the characters corresponding to the number corresponding to the transferred text are equal, the value of the third element in the third column of the edit distance matrix is the value of the element in the upper left corner.
  • the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus one.
  • the value of the fourth element in the third column of the edit distance matrix is calculated in turn until the value of each element in the edit distance matrix is completed, That is, then calculate the value of the fourth element in the third column of the edit distance matrix, after calculating the value of each element in the third column of the edit distance matrix, and then calculate the value of each element in the fourth column of the edit distance matrix Until the calculation of the value of each element in the last column of the edit distance matrix is completed, the calculation of the value of each element in the edit distance matrix is completed.
  • Step S104 Record the calculated trajectory of the value of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix.
  • step S104 it includes:
  • the step of marking the origin of the value of each element in the edit distance matrix in the calculation trajectory according to the value of each element in the edit distance matrix includes:
  • the origin of the value of the element in the editing distance matrix is immediately marked, that is, the editing distance matrix is marked while recording the calculation track of the value of each element in the editing distance matrix The origin of the value of each element in.
  • the step of marking the origin of the value of each element in the edit distance matrix in the calculation trajectory according to the value of each element in the edit distance matrix includes:
  • the value of each element in the editing distance matrix is marked according to the calculation trajectory of the value of each element in the editing distance matrix.
  • the origin of the value of each element in the mark edit distance matrix will be triggered until the origin of the value of each element in the mark edit distance matrix is completed. That is, the calculation trajectory of the value of each element in the unedited edit distance matrix does not cause the origin of the value of each element in the edited edit distance matrix.
  • Step S105 Calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text, and obtain a first trajectory.
  • the trajectory matrix After generating the trajectory matrix, calculate the similarity of each trajectory in the trajectory matrix. After calculating the similarity of each trajectory, filter the trajectory with the highest similarity between the partially transferred text and the template text to obtain the first trajectory. It is considered that the partially transferred text corresponds to the track on the template text.
  • the step of calculating the similarity of each trajectory in the trajectory matrix includes:
  • the ratio of the number of equal characters of each track in the track matrix to the total number of corresponding characters is calculated to obtain the similarity of each track in the track matrix.
  • the trajectory matrix After generating the trajectory matrix, identify the number of characters of partially transferred text in each trajectory of the trajectory matrix equal to the characters of the corresponding template text to obtain the equal number of characters, after obtaining the equal number of characters in each trajectory in the trajectory matrix , Compare the character length of the partially transferred text in each track in the track matrix with the length of the character of the corresponding template text, and select the long length as the total number of characters, if the length of the part of the transferred text in a track in the track matrix is greater than Corresponding to the length of the characters of the template text, the length of the characters selected as part of the transferred text in a track in the track matrix is the total number of characters.
  • the length of the characters of the partially transferred text in a track in the track matrix is less than the length of the characters of the corresponding template text
  • the length of the characters of the template text selected in a track in the track matrix is the total number of characters. After the long length is selected as the total number of characters, the ratio of the number of equal characters of each track in the track matrix to the corresponding total number of characters is calculated, and after the calculation of the ratio is completed, the similarity of each track in the track matrix is obtained.
  • Step S106 According to the first trajectory, determine a start point and an end point of the partially transferred text on the template text to obtain a first start point and a first end point.
  • the start point and the end point corresponding to the part of the transferred text on the template text are determined, thereby obtaining the first start point and the first end point.
  • step S106 it includes:
  • Step S107 Acquire a new template text from the template text according to the first start point and the first end point.
  • the text between the two points is obtained, including the characters corresponding to the first start point and the first end point of the template, respectively. Get the new template text.
  • step S107 it includes:
  • Intercept characters between the first start point and the first end point where the characters between the first start point and the first end point include characters corresponding to the first start point and the first end point character;
  • characters between the first start point and the first end point of the template text are intercepted, wherein the characters between the first start point and the first end point of the template text include the first start point of the template text corresponding to Character corresponding to the first end point.
  • the text in the same format as the template text is generated to obtain the new template text.
  • Step S108 Compare the partially transferred text with the new template text, and calculate an accuracy rate of the partially transferred text through an edit distance algorithm.
  • part of the transferred text is compared with the new template text, not the template text.
  • the accuracy of the part of the transferred text is calculated by the edit distance algorithm, thereby solving the accuracy of the existing text.
  • the rate algorithm compares the text that has been transferred to the entire text of the template text. When part of the text is transferred, the problem of the accuracy of the text transfer cannot be accurately calculated.
  • an edit distance matrix is established, the value of each element in the edit distance matrix is calculated, and it is generated according to the calculated trajectory of the value of each element in the edit distance matrix Trajectory matrix, calculate the similarity of each trajectory in the trajectory matrix, select the trajectory with the highest similarity to obtain the first trajectory, and obtain the corresponding start and end points of the partially transferred text on the template text according to the first trajectory, thereby obtaining the new template text.
  • calculate the accuracy of the part of the transferred text aim to solve the existing text transfer accuracy algorithm, the text has been transferred and the template text of all the text
  • the accuracy of the transliteration of the text cannot be accurately calculated.
  • an embodiment of the present application proposes a device 1 for calculating text accuracy based on semantic analysis.
  • the device 1 includes a first acquiring module 11, a establishing module 12, a first calculating module 13, a generating module 14, and a filtering module 15 , An acquisition module 16, a second acquisition module 17, and a second calculation module 18.
  • the first acquiring module 11 is configured to acquire the partially transferred text starting from any position of the template text except the starting point.
  • the template text is transferred from any position other than the starting point, and the template text is not all transferred, that is, it is transferred from any character of the template text, but does not include the first character. If it is transferred from the non-first character of the template text, the end point of the transfer is any character after the one character that was started to be transferred in the template text, where one character that was started to be transferred in the template text Any subsequent character includes a character that is initially transferred in the template text.
  • the template text is a correct text and is used to compare the text with the part of the transliterated text.
  • the above-mentioned transliteration refers to transcribing speech into text through an ASR (speech recognition) engine.
  • the establishment module 12 is used to establish the editing distance matrix by using the length of the template text characters to increase the length of two characters as the number of columns, and the length of the partially transferred text characters to increase the length of the two characters as the number of lines.
  • the template text is text text excluding punctuation marks.
  • Part of the transliterated text is text with punctuation marks removed.
  • Get the length of partially transferred text characters increase the length of two characters according to the length of partially transferred text characters, as the number of lines, and then increase the length of the two characters of the template text characters as the number of columns, and partially transfer
  • the length of the text characters is increased by the length of two characters to the number of lines, and an edit distance matrix is established.
  • the length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of rows.
  • the purpose is to enter the template text on the first row and the first column, Partially transfer text, and enter the initial value in the second row and second column.
  • the device 1 includes:
  • a first input module for inputting characters of the template text starting from the third element of the first row of the editing distance matrix
  • a second input module for inputting characters of the partially transferred text starting from the third element of the first column of the editing distance matrix
  • a definition module used to define the value of the second element in the second row of the edit distance matrix as 0;
  • the first initialization module is used to sequentially increment the value 1 by the value of the second element in the second row of the edit distance matrix to initialize the value of each element of the second row of the edit distance matrix;
  • a second initialization module is used to sequentially increment the value 1 by the value of the second element in the second column of the edit distance matrix to initialize the value of each element of the second column of the edit distance matrix.
  • the characters of the template text in the first row of the editing distance matrix specifically, the characters of the template text starting from the third element of the first row of the editing distance matrix.
  • the characters of the partially transferred text are input in the first column of the edit distance matrix, specifically, the characters of the partially transferred text are input starting from the third element of the first column of the edit distance matrix.
  • the first row of the edit distance matrix and the third element of the first column start to input the characters of the template text and the characters of the partially transferred text, so that each character of the template text and the characters of the partially transferred text exist in the edit distance matrix
  • the corresponding relationship is also to provide a corresponding positional relationship for the initial values of the second row and the second column.
  • the value of the second element in the second row of the edit distance matrix as 0.
  • increment the value 1 by the value of the second element in the second row of the edit distance matrix as 0 to initialize the edit distance matrix
  • the values of the elements in the second row of, for example, the values of the second, third, fourth, and fifth elements in the second row of the edit distance matrix are 0, 1, 2, and 3, respectively.
  • the value of the second element in the second row of the edit distance matrix is defined as 0.
  • the value of the second element in the second column of the edit distance matrix is also defined as 0, because the second row of the edit distance matrix
  • the second element in the editing distance matrix is in the same position as the second element in the second column of the edit distance matrix, even if the same element, the second element in the second column of the editing distance matrix is in order of 0
  • the values of the second, third, fourth, and fifth elements in the second column of the edit distance matrix are 0, 1, 2, and 3, respectively. . After initializing the values in the second column and second row of the edit distance matrix, it is possible to calculate the values of the elements in the edit distance matrix.
  • the first calculation module 13 is configured to calculate the value of each element in the edit distance matrix according to the partially transferred text and the template text.
  • the calculation method of the value of each element in the edit distance matrix is determined, and then the edit distance matrix is calculated The value of each element in.
  • the first calculation module 13 includes:
  • a first identification module used to identify the number of columns and rows of the third element in the third column of the editing distance matrix
  • a second recognition module used to recognize the number of columns and rows where the third element in the third column of the edit distance matrix corresponds to the characters of the template text and the characters of the partially transferred text, respectively;
  • the first judgment module is used to judge the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix Whether the number of rows of elements corresponds to whether the characters of the partially transferred text are equal; if the number of columns of the third element in the third column of the editing distance matrix corresponds to the characters of the template text and the editing distance If the number of rows of the third element in the third column of the matrix is equal to the characters of the partially transferred text, the value of the third element in the third column of the edit distance matrix is the element at the upper left corner If the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix are in the row The numbers corresponding to the characters of the partially transferred text are not equal, then the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus 1;
  • the first sub-calculation module is used to sequentially calculate the value of the fourth element in the third column of the edit distance matrix until the calculation of the value of each element in the edit distance matrix is completed.
  • each element in the editing distance matrix that has not been initialized is determined by the value of one of its left, upper left, and upper elements, at the beginning of the calculation, there are values for the elements that match the left, upper left, and upper sides. Is only the third element in the third column of the edit distance matrix, or the third element in the third column of the edit distance matrix.
  • the third element in the third column of the edit distance matrix is calculated Elements, identifying the number of columns and rows of the third element in the third column of the edit distance matrix, after obtaining the number of columns and rows of the third element in the third column of the edit distance matrix , Identify the number of columns in the third element of the third column of the edit distance matrix corresponding to the characters of the template text, and the number of rows in the third element of the third column of the edit distance matrix corresponding to the part of the transferred text character.
  • the number of columns in the third element of the third column of the edit distance matrix corresponds to the number of characters in the template text corresponds to the number of rows in the third element of the third column of the edit distance matrix Whether the characters of the partially transferred text are equal, according to whether the characters of the template text of the third element in the third column of the edit distance matrix are equal to the characters of the corresponding partially transferred text, used to determine the third column of the edit distance matrix
  • the value of the third element in the if the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix is in the row If the characters corresponding to the number corresponding to the transferred text are equal, the value of the third element in the third column of the edit distance matrix is the value of the element in the upper left corner.
  • the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus one.
  • the value of the fourth element in the third column of the edit distance matrix is calculated in turn until the value of each element in the edit distance matrix is completed, That is, then calculate the value of the fourth element in the third column of the edit distance matrix, after calculating the value of each element in the third column of the edit distance matrix, and then calculate the value of each element in the fourth column of the edit distance matrix Until the calculation of the value of each element in the last column of the edit distance matrix is completed, the calculation of the value of each element in the edit distance matrix is completed.
  • the generating module 14 is configured to record the calculated trajectory of each element in the edit distance matrix and generate a trajectory matrix corresponding to the edit distance matrix.
  • the generation module 14 includes:
  • the first recording module is used to record the calculation track of the values of each element in the editing distance matrix
  • the first marking module is used to mark the origin of the value of each element in the editing distance matrix according to the calculation trajectory of the value of each element in the editing distance matrix;
  • the first generating module is used to generate a trajectory matrix corresponding to the editing distance matrix after the marking is completed.
  • the first marking module includes:
  • the first sub-marking module is used to mark the origin of the value of the element in the edit distance matrix each time the calculation track of the value of an element in the edit distance matrix is recorded;
  • the first sub-marking completion module is used to mark the origin of the value of each element in the editing distance matrix.
  • the origin of the value of the element in the editing distance matrix is immediately marked, that is, the editing distance matrix is marked while recording the calculation track of the value of each element in the editing distance matrix The origin of the value of each element in.
  • the first marking module includes:
  • the second sub-marking module is used to mark each element in the editing distance matrix according to the calculation track of the value of each element in the editing distance matrix after completing the calculation of the track for calculating the value of each element in the editing distance matrix The origin of the value.
  • the origin of the value of each element in the mark edit distance matrix will be triggered until the origin of the value of each element in the mark edit distance matrix is completed. That is, the calculation trajectory of the value of each element in the unedited edit distance matrix does not cause the origin of the value of each element in the edited edit distance matrix.
  • the filtering module 15 is configured to calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text, and obtain a first trajectory.
  • the trajectory matrix After generating the trajectory matrix, calculate the similarity of each trajectory in the trajectory matrix. After calculating the similarity of each trajectory, filter the trajectory with the highest similarity between the partially transferred text and the template text to obtain the first trajectory It is considered that the partially transferred text corresponds to the track on the template text.
  • the screening module 15 includes:
  • a third recognition module used to recognize the number of characters of the partially transferred text in each track in the track matrix equal to the corresponding characters of the template text to obtain an equal number of characters
  • the first comparison module is used to compare the length of the characters of the partially transferred text in each trajectory in the trajectory matrix with the corresponding length of the characters of the template text, and select the length as the total number of characters;
  • a third calculation module is used to calculate the ratio of the number of equal characters of each track in the track matrix to the total number of corresponding characters, to obtain the similarity of each track in the track matrix.
  • the trajectory matrix After generating the trajectory matrix, identify the number of characters of partially transferred text in each trajectory of the trajectory matrix equal to the characters of the corresponding template text to obtain the equal number of characters, after obtaining the equal number of characters in each trajectory in the trajectory matrix , Compare the character length of the partially transferred text in each track in the track matrix with the length of the character of the corresponding template text, and select the long length as the total number of characters, if the length of the part of the transferred text in a track in the track matrix is greater than Corresponding to the length of the characters of the template text, the length of the characters selected as part of the transferred text in a track in the track matrix is the total number of characters.
  • the length of the characters of the partially transferred text in a track in the track matrix is less than the length of the characters of the corresponding template text
  • the length of the characters of the template text selected in a track in the track matrix is the total number of characters. After the long length is selected as the total number of characters, the ratio of the number of equal characters of each track in the track matrix to the corresponding total number of characters is calculated, and after the calculation of the ratio is completed, the similarity of each track in the track matrix is obtained.
  • the obtaining module 16 is configured to determine the starting point and the ending point of the partially transferred text on the template text according to the first trajectory, and obtain the first starting point and the first ending point.
  • the start point and the end point corresponding to the part of the transferred text on the template text are determined to obtain the first start point and the first end point.
  • the obtaining module 16 includes:
  • a second marking module used to mark the first element and the last element in the first track
  • the first obtaining module is configured to mark the characters of the template text corresponding to the first element and the last element in the first track to obtain a first start point and a first end point.
  • the second obtaining module 17 is configured to obtain a new template text from the template text according to the first start point and the first end point.
  • the text between the two points is obtained, including the characters corresponding to the first start point and the first end point of the template, respectively. Get the new template text.
  • the second acquisition module 17 includes:
  • An interception module configured to intercept characters between the first start point and the first end point, wherein characters between the first start point and the first end point include characters corresponding to the first start point and the The character corresponding to the first end point;
  • the second sub-acquisition module is used for generating text according to the intercepted characters and acquiring the new template text.
  • characters between the first start point and the first end point of the template text are intercepted, wherein the characters between the first start point and the first end point of the template text include the first start point of the template text corresponding to Character corresponding to the first end point.
  • the text in the same format as the template text is generated to obtain the new template text.
  • the second calculation module 18 is used to compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
  • part of the transferred text is compared with the new template text, not the template text.
  • the accuracy of the part of the transferred text is calculated by the edit distance algorithm, thereby solving the accuracy of the existing text.
  • the rate algorithm compares the text that has been transferred to the entire text of the template text. When part of the text is transferred, the problem of the accuracy of the text transfer cannot be accurately calculated.
  • a computer device is also provided in an embodiment of the present application.
  • the computer device may be a server, and its internal structure may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the processor designed by the computer is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory device provides an environment for operating systems and computer programs in non-volatile storage media.
  • the database of the computer device is used to store data such as a model of a text accuracy calculation method based on semantic analysis.
  • the network interface of the computer device is used to communicate with external terminals through a network connection.
  • the computer program is executed by the processor to realize a text accuracy calculation method based on semantic analysis.
  • the above-mentioned processor executes the steps of the above-mentioned method for calculating the text accuracy based on semantic analysis: acquiring part of the transferred text starting from any position of the template text except the starting point; adding two characters by the length of the template text characters The length is the number of columns, and the length of the part of the transferred text characters is increased by the length of two characters as the number of lines, and an edit distance matrix is established; based on the part of the transferred text and the template text, the edit distance matrix is calculated The value of each element in the editing distance; record the calculated track of the value of each element in the editing distance matrix to generate a track matrix corresponding to the editing distance matrix; calculate the similarity of each track in the track matrix and filter the partial transfer A track with the highest similarity between the written text and the template text to obtain a first track; based on the first track, determine the corresponding starting point and end point of the partially transferred text on the template text to obtain the first starting point and The first end point; according to the first start point and the first end point, obtain a
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application also provides a computer non-volatile readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, a text accuracy calculation method based on semantic analysis is implemented, specifically: acquiring The part of the transferred text starting from any position other than the starting point of the template text; the length of the template text characters is increased by two characters as the number of columns, and the length of the part of the transferred text characters is increased by two The length of the characters is the number of lines, and an edit distance matrix is established; based on the partially transferred text and the template text, the value of each element in the edit distance matrix is calculated; the value of each element in the edit distance matrix is recorded Calculate the trajectory to generate a trajectory matrix corresponding to the editing distance matrix; calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text to obtain the first trajectory ; According to the first trajectory, determine the start point and end point of the partially transferred text on the template text to obtain the first start point and the first

Abstract

The present application relates to a text accuracy calculation method and apparatus based on semantic parsing, and a computer device. The method comprises: when any position, except a starting point, of a template text starts to be transferred, a calculation trajectory generates a trajectory matrix; calculate the similarity of each trajectory in the trajectory matrix; screen one trajectory having the highest similarity to obtain a first trajectory; obtain a new template text; then compare a part of a transferred text with the new template text; and calculate the accuracy of the part of the transferred text.

Description

基于语义解析的文本准确率计算方法、装置、计算机设备Text accuracy calculation method, device and computer equipment based on semantic analysis
本申请要求于2018年11月13日提交中国专利局、申请号为2018113472352,申请名称为“基于语义解析的文本准确率计算方法、装置、计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on November 13, 2018, with the application number 2018113472352, the application name is "text-accuracy calculation method, device, and computer equipment based on semantic analysis", and its entire content Incorporated by reference in this application.
技术领域Technical field
本申请涉及语义解析技术领域,特别涉及一种基于语义解析的文本准确率计算方法、装置、计算机设备。The present application relates to the technical field of semantic parsing, in particular to a method, device, and computer equipment for calculating text accuracy based on semantic parsing.
背景技术Background technique
在统计ASR(语音识别)引擎转写准确率的时候,常用的算法是编辑距离算法。该算法通过统计转写文本转成模板文本所需的最少编辑操作(编辑操作包括:将一个字符替换成另外一个字符,插入一个字符,删除一个字符)次数来计算转写文本与模板文本的相似度(转写准确率)。但在关注ASR引擎的实时转写准确率的场景下,该算法的计算结果并不能令人满意。由于该算法总是拿已经转写出来的文本与模板文本的全部文本进行对比,因此,当只有部分文本被转写出来的时候,该算法并不能准确地计算出这部分转写出来的文本的转写准确率。因此,编辑距离在关注ASR引擎实时转写准确率的场景下并不适用。When calculating the accuracy of ASR (speech recognition) engine transfer, the commonly used algorithm is the edit distance algorithm. The algorithm calculates the similarity between the transferred text and the template text by counting the minimum edit operations (the edit operations include: replacing a character with another character, inserting a character, and deleting a character) times. Degree (transfer accuracy). However, in the scenario where the real-time transfer accuracy of the ASR engine is concerned, the calculation result of the algorithm is not satisfactory. Because the algorithm always compares the text that has been transferred with the entire text of the template text, when only part of the text is transferred, the algorithm cannot accurately calculate the text of the part of the transferred text Transfer accuracy. Therefore, the editing distance is not applicable in the scenario where the real-time transfer accuracy of the ASR engine is concerned.
技术问题technical problem
针对现有技术不足,本申请提出一种基于语义解析的文本准确率计算方法、装置、计算机设备,旨在解决现有的文本的转写准确率算法,将已经转写出来的文本与模板文本的全部文本进行对比,在部分文本被转写出来的时候,不能准确计算文本的转写准确率的问题。In view of the inadequacy of the existing technology, this application proposes a method, device, and computer equipment for calculating text accuracy based on semantic analysis, which aims to solve the existing text transfer accuracy algorithm and convert the text that has been transferred and the template text Compare all the texts, and when some texts are transferred, the accuracy of the text transfer cannot be accurately calculated.
技术解决方案Technical solution
本申请提出的技术方案是:The technical solutions proposed in this application are:
一种基于语义解析的文本准确率计算方法,所述方法包括:A text accuracy calculation method based on semantic analysis, the method includes:
获取从模板文本除始点之外的任意位置开始被转写的部分转写文本;Get the part of the transferred text starting from any position of the template text except the starting point;
以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵;The length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of lines, and an editing distance matrix is established;
根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值;Calculate the value of each element in the editing distance matrix according to the partially transferred text and the template text;
记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵;Record the calculated trajectory of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix;
计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹;Calculating the similarity of each trajectory in the trajectory matrix, filtering a trajectory with the highest degree of similarity between the partially transferred text and the template text to obtain a first trajectory;
根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点;According to the first trajectory, determine the start point and end point of the partially transferred text on the template text, and obtain the first start point and the first end point;
根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本;Obtain a new template text from the template text according to the first start point and the first end point;
将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。Compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
本申请还提供一种基于语义解析的文本准确率计算装置,所述装置包括:The present application also provides a text accuracy calculation device based on semantic analysis. The device includes:
第一获取模块,用于获取从模板文本除始点之外的任意位置开始被转写的部分转写文本;The first obtaining module is used to obtain the part of the transferred text starting from any position of the template text except the starting point;
建立模块,用于以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵;Establishing a module, used to increase the length of the template text characters by two characters as the number of columns, and the length of the partially transferred text characters by two characters as the number of rows, to establish an editing distance matrix;
第一计算模块,用于根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值;A first calculation module, configured to calculate the value of each element in the edit distance matrix according to the partially transferred text and the template text;
生成模块,用于记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵;A generating module, used to record the calculated trajectory of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix;
筛选模块,用于计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹;The screening module is used to calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text, and obtain a first trajectory;
获得模块,用于根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点;An obtaining module, configured to determine a starting point and an end point of the partially transferred text on the template text according to the first trajectory, and obtain a first starting point and a first end point;
第二获取模块,用于根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本;A second obtaining module, configured to obtain a new template text from the template text according to the first start point and the first end point;
第二计算模块,用于将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。The second calculation module is used to compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述任一项所述的方法的步骤。The present application also provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the steps of any one of the above methods are implemented.
本申请还提供一种计算机非易失性可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一项所述的方法的步骤。The present application also provides a computer non-volatile readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method described in any one of the above are implemented.
有益效果Beneficial effect
根据上述的技术方案,本申请有益效果:在模板文本的除始点之外的任意位置开始被转写时,建立编辑 距离矩阵,计算编辑距离矩阵中各元素的值,根据编辑距离矩阵中各元素的值的计算轨迹生成轨迹矩阵,计算轨迹矩阵中各轨迹的相似度,筛选相似度最高的一条轨迹获得第一轨迹,根据第一轨迹获得部分转写文本在模板文本上对应的始点和终点,从而获得新模板文本,再将部分转写文本与新模板文本进行对比,计算部分转写文本的准确率,旨在解决现有的文本的转写准确率算法,将已经转写出来的文本与模板文本的全部文本进行对比,在部分文本被转写出来的时候,不能准确计算文本的转写准确率的问题。According to the above technical solution, the present application has a beneficial effect: when the template text starts to be transferred at any position other than the starting point, an editing distance matrix is established, the values of each element in the editing distance matrix are calculated, and according to each element in the editing distance matrix Calculate the trajectory of the value to generate a trajectory matrix, calculate the similarity of each trajectory in the trajectory matrix, filter the trajectory with the highest similarity to obtain the first trajectory, and obtain the corresponding start and end points of the partially transferred text on the template text according to the first trajectory, In order to obtain the new template text, and then compare the part of the transferred text with the new template text, calculate the accuracy of the part of the transferred text, aiming to solve the existing text transfer accuracy algorithm, the text has been transferred and The entire text of the template text is compared. When part of the text is transferred, the accuracy of the text transfer cannot be accurately calculated.
附图说明BRIEF DESCRIPTION
图1是应用本申请实施例提供的基于语义解析的文本准确率计算方法的流程图;FIG. 1 is a flowchart of applying a text resolution calculation method based on semantic analysis provided by an embodiment of the present application;
图2是应用本申请实施例提供的基于语义解析的文本准确率计算装置的功能模块图;2 is a functional block diagram of a text accuracy calculation device based on semantic analysis provided by an embodiment of the present application;
图3是应用本申请实施例提供的计算机设备的结构示意框图。FIG. 3 is a schematic block diagram of a structure of a computer device provided by using an embodiment of the present application.
本发明的最佳实施方式Best Mode of the Invention
如图1所示,本申请实施例提出一种基于语义解析的文本准确率计算方法,所述方法包括以下步骤:As shown in FIG. 1, an embodiment of the present application provides a method for calculating text accuracy based on semantic analysis. The method includes the following steps:
步骤S101、获取从模板文本除始点之外的任意位置开始被转写的部分转写文本。Step S101: Acquire part of the transferred text starting from any position of the template text except the starting point.
从模板文本除始点之外的任意位置开始被转写,且模板文本未全部被转写,也就是从模板文本的任一个字符开始被转写,但不包括第一个字符。若从模板文本的非第一个字符开始被转写,则转写的结束点在模板文本中开始被转写的一个字符之后的任意一个字符,其中在模板文本中开始被转写的一个字符之后的任意一个字符包括在模板文本中开始被转写的一个字符。The template text is transferred from any position other than the starting point, and the template text is not all transferred, that is, it is transferred from any character of the template text, but does not include the first character. If it is transferred from the non-first character of the template text, the end point of the transfer is any character after the one character that was started to be transferred in the template text, where one character that was started to be transferred in the template text Any subsequent character includes a character that is initially transferred in the template text.
由于不是对模板文本全部的字符的转写,为此,从模板文本除始点之外的任意位置开始被转写所得到文本称为部分转写文本。Since it is not the transfer of all characters of the template text, for this reason, the text obtained from the transfer of the template text at any position other than the starting point is called partial transfer text.
模板文本是一个正确的文本,用于与部分转写文本进行对比的文本。The template text is a correct text and is used to compare the text with the part of the transliterated text.
上述的转写是指通过ASR(语音识别)引擎将语音转写为文本。The above-mentioned transliteration refers to transcribing speech into text through an ASR (speech recognition) engine.
步骤S102、以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵。In step S102, the length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of rows to establish an edit distance matrix.
在本实施例中,模板文本为剔除标点符号的文字文本。部分转写文本为剔除标点符号的文字文本。In this embodiment, the template text is text text excluding punctuation marks. Part of the transliterated text is text with punctuation marks removed.
获取模板文本字符的长度,根据模板文本字符长度再增加两个字符的长度,作为列数。获取部分转写文本字符的长度,根据部分转写文本字符的长度再增加两个字符的长度,作为行数,然后以模板文本字符的长度增加两个字符的长度为列数、以部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵。模板文本字符长度再增加两个字符的长度作为列数、部分转写文本字符的长度再增加两个字符的长度作为行数的目的是为了在第一行、第一列上分别输入模板文本、部分转写文本,以及在第二行、第二列输入初始化的值。Obtain the length of the template text characters, and increase the length of two characters according to the length of the template text characters as the number of columns. Get the length of partially transferred text characters, increase the length of two characters according to the length of partially transferred text characters, as the number of lines, and then increase the length of the two characters of the template text characters as the number of columns, and partially transfer The length of the text characters is increased by the length of two characters to the number of lines, and an edit distance matrix is established. The length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of rows. The purpose is to enter the template text on the first row and the first column, Partially transfer text, and enter the initial value in the second row and second column.
具体地,在步骤S102之后,且在步骤S103之前,包括:Specifically, after step S102 and before step S103, it includes:
从所述编辑距离矩阵的第一行的第三个元素开始输入所述模板文本的字符;Input characters of the template text from the third element of the first row of the edit distance matrix;
从所述编辑距离矩阵的第一列的第三个元素开始输入所述部分转写文本的字符;Starting from the third element of the first column of the edit distance matrix, input the characters of the partially transferred text;
定义所述编辑距离矩阵的第二行中的第二个元素的值为0;Define that the value of the second element in the second row of the edit distance matrix is 0;
以所述编辑距离矩阵的第二行中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二行的各元素的值;Initialize the value of each element in the second row of the editing distance matrix with the value of the second element in the second row of the editing distance matrix increasing by a value of 1 in turn;
以所述编辑距离矩阵的第二列中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二列的各元素的值。The value of each element in the second column of the edit distance matrix is sequentially incremented by a value of 1 to initialize the value of each element in the second column of the edit distance matrix.
在编辑距离矩阵的第一行中输入模板文本的字符,具体地,从编辑距离矩阵的第一行的第三个元素开始输入模板文本的字符。对应地,在编辑距离矩阵的第一列中输入部分转写文本的字符,具体地,从编辑距离矩阵的第一列的第三个元素开始输入部分转写文本的字符。编辑距离矩阵的第一行、第一列的第三个元素分别开始输入模板文本的字符、部分转写文本的字符,使模板文本的各字符与部分转写文本各字符都在编辑距离矩阵存在对应关系,另外,也是为了对第二行、第二列的初始化的数值提供对应的位置关系。首先,定义编辑距离矩阵的第二行中的第二个元素的值为0,然后,以编辑距离矩阵的第二行中的第二个元素的值为0依次递增数值1,初始化编辑距离矩阵的第二行的各元素的值,例如,编辑距离矩阵的第二行中的第二、三、四、五个元素的值分别为0、1、2、3。定义编辑距离矩阵的第二行中的第二个元素的值为0,实质上,也定义编辑距离矩阵的第二列中的第二个元素的值为0,因为编辑距离矩阵的第二行中的第二个元素与编辑距离矩阵的第二列中的第二个元素是在同一个位置,即使同一个元素,以编辑距离矩阵的第二列中的第二个元素的值为0依次递增数值1,初始化编辑距离矩阵的第二列的各元素的值,例如,编辑距离矩阵的第二列中的第二、三、四、五个元素的值分别为0、1、2、3。在初始化编辑距离矩阵的第二列、第二行的数值之后,可能进行计算编辑距离矩阵中各元素的值。Enter the characters of the template text in the first row of the editing distance matrix, specifically, the characters of the template text starting from the third element of the first row of the editing distance matrix. Correspondingly, the characters of the partially transferred text are input in the first column of the edit distance matrix, specifically, the characters of the partially transferred text are input starting from the third element of the first column of the edit distance matrix. The first row of the edit distance matrix and the third element of the first column start to input the characters of the template text and the characters of the partially transferred text, so that each character of the template text and the characters of the partially transferred text exist in the edit distance matrix The corresponding relationship is also to provide a corresponding positional relationship for the initial values of the second row and the second column. First, define the value of the second element in the second row of the edit distance matrix as 0. Then, increment the value 1 by the value of the second element in the second row of the edit distance matrix as 0 to initialize the edit distance matrix The values of the elements in the second row of, for example, the values of the second, third, fourth, and fifth elements in the second row of the edit distance matrix are 0, 1, 2, and 3, respectively. The value of the second element in the second row of the edit distance matrix is defined as 0. In essence, the value of the second element in the second column of the edit distance matrix is also defined as 0, because the second row of the edit distance matrix The second element in the editing distance matrix is in the same position as the second element in the second column of the edit distance matrix, even if the same element, the second element in the second column of the editing distance matrix is in order of 0 Increment the value 1 to initialize the value of each element in the second column of the edit distance matrix. For example, the values of the second, third, fourth, and fifth elements in the second column of the edit distance matrix are 0, 1, 2, and 3, respectively. . After initializing the values in the second column and second row of the edit distance matrix, it is possible to calculate the values of the elements in the edit distance matrix.
步骤S103、根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值。Step S103: Calculate the value of each element in the edit distance matrix according to the partially transferred text and the template text.
根据部分转写文本、模板文本,也就是,在编辑距离矩阵中,部分转写文本的字符对应模板文本的字符是否相等,决定编辑距离矩阵中各元素的值的计算方式,进而计算编辑距离矩阵中各元素的值。According to the partial transfer text and template text, that is, in the edit distance matrix, whether the characters of the partial transfer text correspond to the characters of the template text are equal, the calculation method of the value of each element in the edit distance matrix is determined, and then the edit distance matrix is calculated The value of each element in.
在本实施例中,编辑距离矩阵中未被初始化的各元素的值由其左方、左上角、上方中的某一个元素的值来确定。在步骤S103中,包括:In this embodiment, the value of each element in the edit distance matrix that has not been initialized is determined by the value of one of the elements on the left, top left, and top. In step S103, it includes:
识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数;Identify the number of columns and rows of the third element in the third column of the edit distance matrix;
识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数分别对应所述模板文本的字符、所述部分转写文本的字符;Identify the number of columns and rows where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the character of the partially transferred text, respectively;
判断所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符是否相等;Judging that the number of columns where the third element in the third column of the editing distance matrix corresponds to the character of the template text corresponds to the number of rows where the third element in the third column of the editing distance matrix is Whether the characters in the transliterated text are equal;
若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符相等,则所述编辑距离矩阵 的第三列中的第三个元素的值为其左上角的元素的值;If the number of columns of the third element in the third column of the edit distance matrix corresponds to the character of the template text and the number of rows of the third element in the third column of the edit distance matrix corresponds to If the characters of the transferred text are equal, the value of the third element in the third column of the edit distance matrix is the value of the element in the upper left corner;
若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符不相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左方、左上角、上方的元素中最小值加1得到;If the number of columns of the third element in the third column of the edit distance matrix corresponds to the character of the template text and the number of rows of the third element in the third column of the edit distance matrix corresponds to If the characters of the partially transferred text are not equal, the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus 1;
依次计算所述编辑距离矩阵的第三列中的第四个元素的值,直至完成计算所述编辑距离矩阵中各元素的值。Calculate the value of the fourth element in the third column of the edit distance matrix in sequence until the value of each element in the edit distance matrix is completed.
由于编辑距离矩阵中未被初始化的各元素的值由其左方、左上角、上方中的某一个元素的值来确定,在开始计算时,符合左方、左上角、上方的元素都存在数值的只有编辑距离矩阵的第三列中的第三个元素,或者说编辑距离矩阵的第三列中的第三个元素,在本实施例中,计算编辑距离矩阵的第三列中的第三个元素,识别编辑距离矩阵的第三列中的第三个元素所处于的列数、行数,在获得编辑距离矩阵的第三列中的第三个元素所处于的列数、行数之后,识别编辑距离矩阵的第三列中的第三个元素所处于的列数对应模板文本的字符,编辑距离矩阵的第三列中的第三个元素所处于的行数对应部分转写文本的字符。在获得对应的字符之后,判断编辑距离矩阵的第三列中的第三个元素所处于的列数对应模板文本的字符与编辑距离矩阵的第三列中的第三个元素所处于行数对应部分转写文本的字符是否相等,根据编辑距离矩阵的第三列中的第三个元素的模板文本的字符与对应的部分转写文本的字符是否相等,用于确定编辑距离矩阵的第三列中的第三个元素的值,若编辑距离矩阵的第三列中的第三个元素所处于的列数对应模板文本的字符与编辑距离矩阵的第三列中的第三个元素所处于行数对应部分转写文本的字符相等,则编辑距离矩阵的第三列中的第三个元素的值为其左上角的元素的值。若编辑距离矩阵的第三列中的第三个元素所处于的列数对应模板文本的字符与编辑距离矩阵的第三列中的第三个元素所处于行数对应部分转写文本的字符不相等,则编辑距离矩阵的第三列中的第三个元素的值为其左方、左上角、上方的元素中最小值加1得到。在计算完编辑距离矩阵的第三列中的第三个元素的值之后,依次计算编辑距离矩阵的第三列中的第四个元素的值,直至完成计算编辑距离矩阵中各元素的值,也就是,接着计算编辑距离矩阵的第三列中的第四个元素的值,在计算完编辑距离矩阵的第三列各元素的值,再计算编辑距离矩阵的第四列中各元素的值,直至计算完成编辑距离矩阵的最后一列中各元素的值,才完成计算编辑距离矩阵中各元素的值。Since the value of each element in the editing distance matrix that has not been initialized is determined by the value of one of its left, upper left, and upper elements, at the beginning of the calculation, there are values for the elements that match the left, upper left, and upper sides. Is only the third element in the third column of the edit distance matrix, or the third element in the third column of the edit distance matrix. In this embodiment, the third element in the third column of the edit distance matrix is calculated Elements, identifying the number of columns and rows of the third element in the third column of the edit distance matrix, after obtaining the number of columns and rows of the third element in the third column of the edit distance matrix , Identify the number of columns in the third element of the third column of the edit distance matrix corresponding to the characters of the template text, and the number of rows in the third element of the third column of the edit distance matrix corresponding to the part of the transferred text character. After obtaining the corresponding characters, determine that the number of columns in the third element of the third column of the edit distance matrix corresponds to the number of characters in the template text corresponds to the number of rows in the third element of the third column of the edit distance matrix Whether the characters of the partially transferred text are equal, according to whether the characters of the template text of the third element in the third column of the edit distance matrix are equal to the characters of the corresponding partially transferred text, used to determine the third column of the edit distance matrix The value of the third element in the if the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix is in the row If the characters corresponding to the number corresponding to the transferred text are equal, the value of the third element in the third column of the edit distance matrix is the value of the element in the upper left corner. If the number of columns in the third element of the third column of the edit distance matrix corresponds to the character of the template text and the number of lines in the third element of the third column of the edit distance matrix corresponds to the character of the part of the transferred text If they are equal, the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus one. After the value of the third element in the third column of the edit distance matrix is calculated, the value of the fourth element in the third column of the edit distance matrix is calculated in turn until the value of each element in the edit distance matrix is completed, That is, then calculate the value of the fourth element in the third column of the edit distance matrix, after calculating the value of each element in the third column of the edit distance matrix, and then calculate the value of each element in the fourth column of the edit distance matrix Until the calculation of the value of each element in the last column of the edit distance matrix is completed, the calculation of the value of each element in the edit distance matrix is completed.
步骤S104、记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵。Step S104: Record the calculated trajectory of the value of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix.
在计算编辑距离矩阵中各元素的值的过程中,记录编辑距离矩阵中各元素的值的计算轨迹,也就是,编辑距离矩阵中各元素的值是由哪一个元素的值决定的。在完成计算编辑距离矩阵中各元素的值之后,记录编辑距离矩阵中各元素的值的计算轨迹也完成,从而生成与编辑距离矩阵对应的轨迹矩阵。In the process of calculating the value of each element in the edit distance matrix, record the calculation trajectory of the value of each element in the edit distance matrix, that is, which element determines the value of each element in the edit distance matrix. After completing the calculation of the values of each element in the editing distance matrix, the calculation trajectory that records the values of each element in the editing distance matrix is also completed, thereby generating a trajectory matrix corresponding to the editing distance matrix.
在本实施例中,在步骤S104中,包括:In this embodiment, in step S104, it includes:
记录所述编辑距离矩阵中各元素的值的计算轨迹;Record the calculated trajectory of each element in the edit distance matrix;
根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来;Mark the origin of the value of each element in the edit distance matrix according to the calculation trajectory of the value of each element in the edit distance matrix;
在完成标记后,生成与所述编辑距离矩阵对应的轨迹矩阵。After the marking is completed, a trajectory matrix corresponding to the editing distance matrix is generated.
记录编辑距离矩阵中各元素的值的计算轨迹,根据编辑距离矩阵中各元素的值的计算轨迹,标记编辑距离矩阵中各元素的值产生由来,在本实施例中,用lt表示该元素通过左上方的元素计算而来,用l表示该元素通过左方的元素计算而来,用t表示该元素通过上方的元素计算而来,例如,若编辑距离矩阵的第三列中的第三个元素是由编辑距离矩阵的第二列中的第二个元素确定的,则在编辑距离矩阵的第三列中的第三个元素输入lt,若编辑距离矩阵的第三列中的第三个元素是由编辑距离矩阵的第二列中的第三个元素确定的,则在编辑距离矩阵的第三列中的第三个元素输入l,若编辑距离矩阵的第三列中的第三个元素是由编辑距离矩阵的第三列中的第二个元素确定的,则在编辑距离矩阵的第三列中的第三个元素输入t,从而标记编辑距离矩阵的第三列中的第三个元素的产生由来。在完成标记后,生成与编辑距离矩阵对应的轨迹矩阵。Record the calculation trajectory of the value of each element in the editing distance matrix, mark the origin of the value of each element in the editing distance matrix according to the calculation trajectory of the value of each element in the editing distance matrix, in this embodiment, use lt to indicate that the element passes The element at the upper left is calculated, with l indicating that the element is calculated from the element on the left, and t indicating that the element is calculated from the element above, for example, if you edit the third in the third column of the distance matrix The element is determined by the second element in the second column of the edit distance matrix, then enter lt in the third element in the third column of the edit distance matrix, if you edit the third element in the third column of the distance matrix The element is determined by the third element in the second column of the edit distance matrix, enter l in the third element in the third column of the edit distance matrix, if you edit the third element in the third column of the distance matrix The element is determined by the second element in the third column of the edit distance matrix, then enter t in the third element in the third column of the edit distance matrix to mark the third in the third column of the edit distance matrix The origin of each element. After the marking is completed, a trajectory matrix corresponding to the edit distance matrix is generated.
在本实施例中,在所述根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来的步骤中,包括:In this embodiment, the step of marking the origin of the value of each element in the edit distance matrix in the calculation trajectory according to the value of each element in the edit distance matrix includes:
在每记录所述编辑距离矩阵中一个元素的值的计算轨迹时,标记所述编辑距离矩阵中该元素的值产生由来;Mark the origin of the value of the element in the edit distance matrix each time the calculation trajectory of the value of an element in the edit distance matrix is recorded;
直至标记所述编辑距离矩阵中各元素的值产生由来。Until the origin of the value of each element in the editing distance matrix is marked.
每记录编辑距离矩阵中一个元素的值的计算轨迹,就立刻标记编辑距离矩阵中该元素的值产生由来,也就是,一边记录编辑距离矩阵中各元素的值的计算轨迹,一边标记编辑距离矩阵中各元素的值产生由来。Every time the calculation track of the value of an element in the editing distance matrix is recorded, the origin of the value of the element in the editing distance matrix is immediately marked, that is, the editing distance matrix is marked while recording the calculation track of the value of each element in the editing distance matrix The origin of the value of each element in.
在一些实施例中,在所述根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来的步骤中,包括:In some embodiments, the step of marking the origin of the value of each element in the edit distance matrix in the calculation trajectory according to the value of each element in the edit distance matrix includes:
在完成记录所述编辑距离矩阵中各元素的值的计算轨迹之后,根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来。After the calculation trajectory of recording the value of each element in the editing distance matrix is completed, the value of each element in the editing distance matrix is marked according to the calculation trajectory of the value of each element in the editing distance matrix.
在完成编辑距离矩阵中各元素的值的计算轨迹之后,才会触发开始执行标记编辑距离矩阵中各元素的值产生由来,直至完成标记编辑距离矩阵中各元素的值产生由来。也就是,在未完成编辑距离矩阵中各元素的值的计算轨迹,不会进行标记编辑距离矩阵中各元素的值产生由来。After the calculation of the value of each element in the edit distance matrix is completed, the origin of the value of each element in the mark edit distance matrix will be triggered until the origin of the value of each element in the mark edit distance matrix is completed. That is, the calculation trajectory of the value of each element in the unedited edit distance matrix does not cause the origin of the value of each element in the edited edit distance matrix.
步骤S105、计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹。Step S105: Calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text, and obtain a first trajectory.
在生成轨迹矩阵之后,计算轨迹矩阵中各轨迹的相似度,在完成计算各轨迹的相似度之后,筛选部分转写文本与模板文本相似度最高的一条轨迹,获得第一轨迹,该第一轨迹认为是部分转写文本在模板 文本上对应轨迹。After generating the trajectory matrix, calculate the similarity of each trajectory in the trajectory matrix. After calculating the similarity of each trajectory, filter the trajectory with the highest similarity between the partially transferred text and the template text to obtain the first trajectory. It is considered that the partially transferred text corresponds to the track on the template text.
在本实施例中,在所述计算所述轨迹矩阵中各轨迹的相似度的步骤中,包括:In this embodiment, the step of calculating the similarity of each trajectory in the trajectory matrix includes:
识别所述轨迹矩阵中各轨迹中所述部分转写文本的字符与对应的所述模板文本的字符相等的个数,获得相等字符个数;Identifying the number of characters of the partially transferred text in each track in the track matrix that are equal to the corresponding characters of the template text to obtain an equal number of characters;
比较所述轨迹矩阵中各轨迹中所述部分转写文本的字符的长度与对应的所述模板文本的字符的长度,选取长度长的作为字符总数;Comparing the character length of the partially transferred text in each track in the track matrix with the corresponding character length of the template text, and selecting the long length as the total number of characters;
计算所述轨迹矩阵中各轨迹的相等字符个数与对应的字符总数的比值,获得所述轨迹矩阵中各轨迹的相似度。The ratio of the number of equal characters of each track in the track matrix to the total number of corresponding characters is calculated to obtain the similarity of each track in the track matrix.
在生成轨迹矩阵之后,识别轨迹矩阵中各轨迹中部分转写文本的字符与对应的模板文本的字符相等的个数,获得相等字符个数,在获得轨迹矩阵中各轨迹中相等字符个数之后,比较轨迹矩阵中各轨迹中部分转写文本的字符的长度与对应的模板文本的字符的长度,选取长度长的作为字符总数,若轨迹矩阵中一轨迹中部分转写文本的字符的长度大于对应的模板文本的字符的长度,则在轨迹矩阵中一轨迹中选取部分转写文本的字符的长度为字符总数。若轨迹矩阵中一轨迹中部分转写文本的字符的长度小于对应的模板文本的字符的长度,则在轨迹矩阵中一轨迹中选取模板文本的字符的长度为字符总数。在选取长度长的作为字符总数之后,计算轨迹矩阵中各轨迹的相等字符个数与对应的字符总数的比值,在完成计算比值之后,获得轨迹矩阵中各轨迹的相似度。After generating the trajectory matrix, identify the number of characters of partially transferred text in each trajectory of the trajectory matrix equal to the characters of the corresponding template text to obtain the equal number of characters, after obtaining the equal number of characters in each trajectory in the trajectory matrix , Compare the character length of the partially transferred text in each track in the track matrix with the length of the character of the corresponding template text, and select the long length as the total number of characters, if the length of the part of the transferred text in a track in the track matrix is greater than Corresponding to the length of the characters of the template text, the length of the characters selected as part of the transferred text in a track in the track matrix is the total number of characters. If the length of the characters of the partially transferred text in a track in the track matrix is less than the length of the characters of the corresponding template text, the length of the characters of the template text selected in a track in the track matrix is the total number of characters. After the long length is selected as the total number of characters, the ratio of the number of equal characters of each track in the track matrix to the corresponding total number of characters is calculated, and after the calculation of the ratio is completed, the similarity of each track in the track matrix is obtained.
步骤S106、根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点。Step S106: According to the first trajectory, determine a start point and an end point of the partially transferred text on the template text to obtain a first start point and a first end point.
在获得第一轨迹之后,根据第一轨迹,由于第一轨迹在轨迹矩阵中存在始点和终点,确定部分转写文本在模板文本上对应的始点和终点,从而获得第一始点和第一终点。After the first trajectory is obtained, according to the first trajectory, since the first trajectory has a start point and an end point in the trajectory matrix, the start point and the end point corresponding to the part of the transferred text on the template text are determined, thereby obtaining the first start point and the first end point.
在本实施例中,在步骤S106中,包括:In this embodiment, in step S106, it includes:
标记所述第一轨迹中第一个元素和最后一个元素;Mark the first element and the last element in the first track;
根据所述第一轨迹中第一个元素和最后一个元素,标记对应的所述模板文本的字符,获得第一始点、第一终点。Mark the characters of the template text corresponding to the first element and the last element in the first track to obtain a first start point and a first end point.
在获得第一轨迹之后,标记第一轨迹中第一个元素,根据第一轨迹中第一个元素,获得第一轨迹中第一个元素对应列上的模板文本的字符,标记对应的模板文本的字符,从而获得第一始点。After obtaining the first track, mark the first element in the first track, according to the first element in the first track, obtain the characters of the template text on the column corresponding to the first element in the first track, and mark the corresponding template text Character to obtain the first starting point.
在获得第一轨迹之后,标记第一轨迹中最后一个元素,根据第一轨迹中最后一个元素,获得第一轨迹中最后一个元素对应列上的模板文本的字符,标记对应的模板文本的字符,从而获得第一终点。After obtaining the first track, mark the last element in the first track, and according to the last element in the first track, obtain the character of the template text on the column corresponding to the last element in the first track, and mark the character of the corresponding template text, Thereby obtaining the first end point.
步骤S107、根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本。Step S107: Acquire a new template text from the template text according to the first start point and the first end point.
在获得第一始点、第一终点之后,根据模板文本的第一始点和第一终点,获得两点之间的文本,包括模第一始点、第一终点分别对应的字符,从而从模板文本中获取新模板文本。After obtaining the first start point and the first end point, according to the first start point and the first end point of the template text, the text between the two points is obtained, including the characters corresponding to the first start point and the first end point of the template, respectively. Get the new template text.
在本实施例中,在步骤S107中,包括:In this embodiment, in step S107, it includes:
截取所述第一始点与所述第一终点之间的字符,其中所述第一始点与所述第一终点之间的字符包括所述第一始点对应的字符和所述第一终点对应的字符;Intercept characters between the first start point and the first end point, where the characters between the first start point and the first end point include characters corresponding to the first start point and the first end point character;
根据截取到的字符生成文本,获取所述新模板文本。Generate text based on the intercepted characters to obtain the new template text.
在得到第一始点、第一终点之后,截取模板文本的第一始点与第一终点之间的字符,其中模板文本的第一始点与第一终点之间的字符包括模板文本的第一始点对应的字符和第一终点对应的字符。在截取到的字符之后,根据截取到的字符,生成与模板文本同样格式的文本,获得新模板文本。After the first start point and the first end point are obtained, characters between the first start point and the first end point of the template text are intercepted, wherein the characters between the first start point and the first end point of the template text include the first start point of the template text corresponding to Character corresponding to the first end point. After the intercepted characters, according to the intercepted characters, the text in the same format as the template text is generated to obtain the new template text.
步骤S108、将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。Step S108: Compare the partially transferred text with the new template text, and calculate an accuracy rate of the partially transferred text through an edit distance algorithm.
在获得新模板文本之后,将部分转写文本与新模板文本进行对比,并不是与模板文本进行对比,通过编辑距离算法计算部分转写文本的准确率,从而解决现有的文本的转写准确率算法,将已经转写出来的文本与模板文本的全部文本进行对比,在部分文本被转写出来的时候,不能准确计算文本的转写准确率的问题。After obtaining the new template text, part of the transferred text is compared with the new template text, not the template text. The accuracy of the part of the transferred text is calculated by the edit distance algorithm, thereby solving the accuracy of the existing text. The rate algorithm compares the text that has been transferred to the entire text of the template text. When part of the text is transferred, the problem of the accuracy of the text transfer cannot be accurately calculated.
综上所述,在模板文本的除始点之外的任意位置开始被转写时,建立编辑距离矩阵,计算编辑距离矩阵中各元素的值,根据编辑距离矩阵中各元素的值的计算轨迹生成轨迹矩阵,计算轨迹矩阵中各轨迹的相似度,筛选相似度最高的一条轨迹获得第一轨迹,根据第一轨迹获得部分转写文本在模板文本上对应的始点和终点,从而获得新模板文本,再将部分转写文本与新模板文本进行对比,计算部分转写文本的准确率,旨在解决现有的文本的转写准确率算法,将已经转写出来的文本与模板文本的全部文本进行对比,在部分文本被转写出来的时候,不能准确计算文本的转写准确率的问题。In summary, when the template text starts to be transferred at any position other than the starting point, an edit distance matrix is established, the value of each element in the edit distance matrix is calculated, and it is generated according to the calculated trajectory of the value of each element in the edit distance matrix Trajectory matrix, calculate the similarity of each trajectory in the trajectory matrix, select the trajectory with the highest similarity to obtain the first trajectory, and obtain the corresponding start and end points of the partially transferred text on the template text according to the first trajectory, thereby obtaining the new template text Then compare part of the transferred text with the new template text, calculate the accuracy of the part of the transferred text, aim to solve the existing text transfer accuracy algorithm, the text has been transferred and the template text of all the text In contrast, when part of the text is transliterated, the accuracy of the transliteration of the text cannot be accurately calculated.
如图2所示,本申请实施例提出一种基于语义解析的文本准确率计算装置1,装置1包括第一获取模块11、建立模块12、第一计算模块13、生成模块14、筛选模块15、获得模块16、第二获取模块17和第二计算模块18。As shown in FIG. 2, an embodiment of the present application proposes a device 1 for calculating text accuracy based on semantic analysis. The device 1 includes a first acquiring module 11, a establishing module 12, a first calculating module 13, a generating module 14, and a filtering module 15 , An acquisition module 16, a second acquisition module 17, and a second calculation module 18.
第一获取模块11,用于获取从模板文本除始点之外的任意位置开始被转写的部分转写文本。The first acquiring module 11 is configured to acquire the partially transferred text starting from any position of the template text except the starting point.
从模板文本除始点之外的任意位置开始被转写,且模板文本未全部被转写,也就是从模板文本的任一个字符开始被转写,但不包括第一个字符。若从模板文本的非第一个字符开始被转写,则转写的结束点在模板文本中开始被转写的一个字符之后的任意一个字符,其中在模板文本中开始被转写的一个字符之后的任意一个字符包括在模板文本中开始被转写的一个字符。The template text is transferred from any position other than the starting point, and the template text is not all transferred, that is, it is transferred from any character of the template text, but does not include the first character. If it is transferred from the non-first character of the template text, the end point of the transfer is any character after the one character that was started to be transferred in the template text, where one character that was started to be transferred in the template text Any subsequent character includes a character that is initially transferred in the template text.
由于不是对模板文本全部的字符的转写,为此,从模板文本除始点之外的任意位置开始被转写所得到文本称为部分转写文本。Since it is not the transfer of all characters of the template text, for this reason, the text obtained from the transfer of the template text at any position other than the starting point is called partial transfer text.
模板文本是一个正确的文本,用于与部分转写文本进行对比的文本。The template text is a correct text and is used to compare the text with the part of the transliterated text.
上述的转写是指通过ASR(语音识别)引擎将语音转写为文本。The above-mentioned transliteration refers to transcribing speech into text through an ASR (speech recognition) engine.
建立模块12,用于以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵。The establishment module 12 is used to establish the editing distance matrix by using the length of the template text characters to increase the length of two characters as the number of columns, and the length of the partially transferred text characters to increase the length of the two characters as the number of lines.
在本实施例中,模板文本为剔除标点符号的文字文本。部分转写文本为剔除标点符号的文字文本。In this embodiment, the template text is text text excluding punctuation marks. Part of the transliterated text is text with punctuation marks removed.
获取模板文本字符的长度,根据模板文本字符长度再增加两个字符的长度,作为列数。获取部分转写文本字符的长度,根据部分转写文本字符的长度再增加两个字符的长度,作为行数,然后以模板文本字符的长度增加两个字符的长度为列数、以部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵。模板文本字符长度再增加两个字符的长度作为列数、部分转写文本字符的长度再增加两个字符的长度作为行数的目的是为了在第一行、第一列上分别输入模板文本、部分转写文本,以及在第二行、第二列输入初始化的值。Obtain the length of the template text characters, and increase the length of two characters according to the length of the template text characters as the number of columns. Get the length of partially transferred text characters, increase the length of two characters according to the length of partially transferred text characters, as the number of lines, and then increase the length of the two characters of the template text characters as the number of columns, and partially transfer The length of the text characters is increased by the length of two characters to the number of lines, and an edit distance matrix is established. The length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of rows. The purpose is to enter the template text on the first row and the first column, Partially transfer text, and enter the initial value in the second row and second column.
具体地,装置1包括:Specifically, the device 1 includes:
第一输入模块,用于从所述编辑距离矩阵的第一行的第三个元素开始输入所述模板文本的字符;A first input module for inputting characters of the template text starting from the third element of the first row of the editing distance matrix;
第二输入模块,用于从所述编辑距离矩阵的第一列的第三个元素开始输入所述部分转写文本的字符;A second input module for inputting characters of the partially transferred text starting from the third element of the first column of the editing distance matrix;
定义模块,用于定义所述编辑距离矩阵的第二行中的第二个元素的值为0;A definition module, used to define the value of the second element in the second row of the edit distance matrix as 0;
第一初始化模块,用于以所述编辑距离矩阵的第二行中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二行的各元素的值;The first initialization module is used to sequentially increment the value 1 by the value of the second element in the second row of the edit distance matrix to initialize the value of each element of the second row of the edit distance matrix;
第二初始化模块,用于以所述编辑距离矩阵的第二列中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二列的各元素的值。A second initialization module is used to sequentially increment the value 1 by the value of the second element in the second column of the edit distance matrix to initialize the value of each element of the second column of the edit distance matrix.
在编辑距离矩阵的第一行中输入模板文本的字符,具体地,从编辑距离矩阵的第一行的第三个元素开始输入模板文本的字符。对应地,在编辑距离矩阵的第一列中输入部分转写文本的字符,具体地,从编辑距离矩阵的第一列的第三个元素开始输入部分转写文本的字符。编辑距离矩阵的第一行、第一列的第三个元素分别开始输入模板文本的字符、部分转写文本的字符,使模板文本的各字符与部分转写文本各字符都在编辑距离矩阵存在对应关系,另外,也是为了对第二行、第二列的初始化的数值提供对应的位置关系。首先,定义编辑距离矩阵的第二行中的第二个元素的值为0,然后,以编辑距离矩阵的第二行中的第二个元素的值为0依次递增数值1,初始化编辑距离矩阵的第二行的各元素的值,例如,编辑距离矩阵的第二行中的第二、三、四、五个元素的值分别为0、1、2、3。定义编辑距离矩阵的第二行中的第二个元素的值为0,实质上,也定义编辑距离矩阵的第二列中的第二个元素的值为0,因为编辑距离矩阵的第二行中的第二个元素与编辑距离矩阵的第二列中的第二个元素是在同一个位置,即使同一个元素,以编辑距离矩阵的第二列中的第二个元素的值为0依次递增数值1,初始化编辑距离矩阵的第二列的各元素的值,例如,编辑距离矩阵的第二列中的第二、三、四、五个元素的值分别为0、1、2、3。在初始化编辑距离矩阵的第二列、第二行的数值之后,可能进行计算编辑距离矩阵中各元素的值。Enter the characters of the template text in the first row of the editing distance matrix, specifically, the characters of the template text starting from the third element of the first row of the editing distance matrix. Correspondingly, the characters of the partially transferred text are input in the first column of the edit distance matrix, specifically, the characters of the partially transferred text are input starting from the third element of the first column of the edit distance matrix. The first row of the edit distance matrix and the third element of the first column start to input the characters of the template text and the characters of the partially transferred text, so that each character of the template text and the characters of the partially transferred text exist in the edit distance matrix The corresponding relationship is also to provide a corresponding positional relationship for the initial values of the second row and the second column. First, define the value of the second element in the second row of the edit distance matrix as 0. Then, increment the value 1 by the value of the second element in the second row of the edit distance matrix as 0 to initialize the edit distance matrix The values of the elements in the second row of, for example, the values of the second, third, fourth, and fifth elements in the second row of the edit distance matrix are 0, 1, 2, and 3, respectively. The value of the second element in the second row of the edit distance matrix is defined as 0. In essence, the value of the second element in the second column of the edit distance matrix is also defined as 0, because the second row of the edit distance matrix The second element in the editing distance matrix is in the same position as the second element in the second column of the edit distance matrix, even if the same element, the second element in the second column of the editing distance matrix is in order of 0 Increment the value 1 to initialize the value of each element in the second column of the edit distance matrix. For example, the values of the second, third, fourth, and fifth elements in the second column of the edit distance matrix are 0, 1, 2, and 3, respectively. . After initializing the values in the second column and second row of the edit distance matrix, it is possible to calculate the values of the elements in the edit distance matrix.
第一计算模块13,用于根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值。The first calculation module 13 is configured to calculate the value of each element in the edit distance matrix according to the partially transferred text and the template text.
根据部分转写文本、模板文本,也就是,在编辑距离矩阵中,部分转写文本的字符对应模板文本的字符是否相等,决定编辑距离矩阵中各元素的值的计算方式,进而计算编辑距离矩阵中各元素的值。According to the partial transfer text and template text, that is, in the edit distance matrix, whether the characters of the partial transfer text correspond to the characters of the template text are equal, the calculation method of the value of each element in the edit distance matrix is determined, and then the edit distance matrix is calculated The value of each element in.
在本实施例中,编辑距离矩阵中未被初始化的各元素的值由其左方、左上角、上方中的某一个元素的值来确定。第一计算模块13包括:In this embodiment, the value of each element in the edit distance matrix that has not been initialized is determined by the value of one of the elements on the left, top left, and top. The first calculation module 13 includes:
第一识别模块,用于识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数;A first identification module, used to identify the number of columns and rows of the third element in the third column of the editing distance matrix;
第二识别模块,用于识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数分别对应所述模板文本的字符、所述部分转写文本的字符;A second recognition module, used to recognize the number of columns and rows where the third element in the third column of the edit distance matrix corresponds to the characters of the template text and the characters of the partially transferred text, respectively;
第一判断模块,用于判断所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符是否相等;若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左上角的元素的值;若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符不相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左方、左上角、上方的元素中最小值加1得到;The first judgment module is used to judge the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix Whether the number of rows of elements corresponds to whether the characters of the partially transferred text are equal; if the number of columns of the third element in the third column of the editing distance matrix corresponds to the characters of the template text and the editing distance If the number of rows of the third element in the third column of the matrix is equal to the characters of the partially transferred text, the value of the third element in the third column of the edit distance matrix is the element at the upper left corner If the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix are in the row The numbers corresponding to the characters of the partially transferred text are not equal, then the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus 1;
第一子计算模块,用于依次计算所述编辑距离矩阵的第三列中的第四个元素的值,直至完成计算所述编辑距离矩阵中各元素的值。The first sub-calculation module is used to sequentially calculate the value of the fourth element in the third column of the edit distance matrix until the calculation of the value of each element in the edit distance matrix is completed.
由于编辑距离矩阵中未被初始化的各元素的值由其左方、左上角、上方中的某一个元素的值来确定,在开始计算时,符合左方、左上角、上方的元素都存在数值的只有编辑距离矩阵的第三列中的第三个元素,或者说编辑距离矩阵的第三列中的第三个元素,在本实施例中,计算编辑距离矩阵的第三列中的第三个元素,识别编辑距离矩阵的第三列中的第三个元素所处于的列数、行数,在获得编辑距离矩阵的第三列中的第三个元素所处于的列数、行数之后,识别编辑距离矩阵的第三列中的第三个元素所处于的列数对应模板文本的字符,编辑距离矩阵的第三列中的第三个元素所处于的行数对应部分转写文本的字符。在获得对应的字符之后,判断编辑距离矩阵的第三列中的第三个元素所处于的列数对应模板文本的字符与编辑距离矩阵的第三列中的第三个元素所处于行数对应部分转写文本的字符是否相等,根据编辑距离矩阵的第三列中的第三个元素的模板文本的字符与对应的部分转写文本的字符是否相等,用于确定编辑距离矩阵的第三列中的第三个元素的值,若编辑距离矩阵的第三列中的第三个元素所处于的列数对应模板文本的字符与编辑距离矩阵的第三列中的第三个元素所处于行数对应部分转写文本的字符相等,则编辑距离矩阵的第三列中的第三个元素的值为其左上角的元素的值。若编辑距离矩阵的第三列中的第 三个元素所处于的列数对应模板文本的字符与编辑距离矩阵的第三列中的第三个元素所处于行数对应部分转写文本的字符不相等,则编辑距离矩阵的第三列中的第三个元素的值为其左方、左上角、上方的元素中最小值加1得到。在计算完编辑距离矩阵的第三列中的第三个元素的值之后,依次计算编辑距离矩阵的第三列中的第四个元素的值,直至完成计算编辑距离矩阵中各元素的值,也就是,接着计算编辑距离矩阵的第三列中的第四个元素的值,在计算完编辑距离矩阵的第三列各元素的值,再计算编辑距离矩阵的第四列中各元素的值,直至计算完成编辑距离矩阵的最后一列中各元素的值,才完成计算编辑距离矩阵中各元素的值。Since the value of each element in the editing distance matrix that has not been initialized is determined by the value of one of its left, upper left, and upper elements, at the beginning of the calculation, there are values for the elements that match the left, upper left, and upper sides. Is only the third element in the third column of the edit distance matrix, or the third element in the third column of the edit distance matrix. In this embodiment, the third element in the third column of the edit distance matrix is calculated Elements, identifying the number of columns and rows of the third element in the third column of the edit distance matrix, after obtaining the number of columns and rows of the third element in the third column of the edit distance matrix , Identify the number of columns in the third element of the third column of the edit distance matrix corresponding to the characters of the template text, and the number of rows in the third element of the third column of the edit distance matrix corresponding to the part of the transferred text character. After obtaining the corresponding characters, determine that the number of columns in the third element of the third column of the edit distance matrix corresponds to the number of characters in the template text corresponds to the number of rows in the third element of the third column of the edit distance matrix Whether the characters of the partially transferred text are equal, according to whether the characters of the template text of the third element in the third column of the edit distance matrix are equal to the characters of the corresponding partially transferred text, used to determine the third column of the edit distance matrix The value of the third element in the if the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix is in the row If the characters corresponding to the number corresponding to the transferred text are equal, the value of the third element in the third column of the edit distance matrix is the value of the element in the upper left corner. If the number of columns in the third element of the third column of the edit distance matrix corresponds to the character of the template text and the number of lines in the third element of the third column of the edit distance matrix corresponds to the character of the part of the transferred text If they are equal, the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus one. After the value of the third element in the third column of the edit distance matrix is calculated, the value of the fourth element in the third column of the edit distance matrix is calculated in turn until the value of each element in the edit distance matrix is completed, That is, then calculate the value of the fourth element in the third column of the edit distance matrix, after calculating the value of each element in the third column of the edit distance matrix, and then calculate the value of each element in the fourth column of the edit distance matrix Until the calculation of the value of each element in the last column of the edit distance matrix is completed, the calculation of the value of each element in the edit distance matrix is completed.
生成模块14,用于记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵。The generating module 14 is configured to record the calculated trajectory of each element in the edit distance matrix and generate a trajectory matrix corresponding to the edit distance matrix.
在计算编辑距离矩阵中各元素的值的过程中,记录编辑距离矩阵中各元素的值的计算轨迹,也就是,编辑距离矩阵中各元素的值是由哪一个元素的值决定的。在完成计算编辑距离矩阵中各元素的值之后,记录编辑距离矩阵中各元素的值的计算轨迹也完成,从而生成与编辑距离矩阵对应的轨迹矩阵。In the process of calculating the value of each element in the edit distance matrix, record the calculation trajectory of the value of each element in the edit distance matrix, that is, which element determines the value of each element in the edit distance matrix. After completing the calculation of the values of each element in the editing distance matrix, the calculation trajectory that records the values of each element in the editing distance matrix is also completed, thereby generating a trajectory matrix corresponding to the editing distance matrix.
在本实施例中,生成模块14包括:In this embodiment, the generation module 14 includes:
第一记录模块,用于记录所述编辑距离矩阵中各元素的值的计算轨迹;The first recording module is used to record the calculation track of the values of each element in the editing distance matrix;
第一标记模块,用于根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来;The first marking module is used to mark the origin of the value of each element in the editing distance matrix according to the calculation trajectory of the value of each element in the editing distance matrix;
第一生成模块,用于在完成标记后,生成与所述编辑距离矩阵对应的轨迹矩阵。The first generating module is used to generate a trajectory matrix corresponding to the editing distance matrix after the marking is completed.
记录编辑距离矩阵中各元素的值的计算轨迹,根据编辑距离矩阵中各元素的值的计算轨迹,标记编辑距离矩阵中各元素的值产生由来,在本实施例中,用lt表示该元素通过左上方的元素计算而来,用l表示该元素通过左方的元素计算而来,用t表示该元素通过上方的元素计算而来,例如,若编辑距离矩阵的第三列中的第三个元素是由编辑距离矩阵的第二列中的第二个元素确定的,则在编辑距离矩阵的第三列中的第三个元素输入lt,若编辑距离矩阵的第三列中的第三个元素是由编辑距离矩阵的第二列中的第三个元素确定的,则在编辑距离矩阵的第三列中的第三个元素输入l,若编辑距离矩阵的第三列中的第三个元素是由编辑距离矩阵的第三列中的第二个元素确定的,则在编辑距离矩阵的第三列中的第三个元素输入t,从而标记编辑距离矩阵的第三列中的第三个元素的产生由来。在完成标记后,生成与编辑距离矩阵对应的轨迹矩阵。Record the calculation trajectory of the value of each element in the editing distance matrix, mark the origin of the value of each element in the editing distance matrix according to the calculation trajectory of the value of each element in the editing distance matrix, in this embodiment, use lt to indicate that the element passes The element at the upper left is calculated, with l indicating that the element is calculated from the element on the left, and t indicating that the element is calculated from the element above, for example, if you edit the third in the third column of the distance matrix The element is determined by the second element in the second column of the edit distance matrix, then enter lt in the third element in the third column of the edit distance matrix, if you edit the third element in the third column of the distance matrix The element is determined by the third element in the second column of the edit distance matrix, enter l in the third element in the third column of the edit distance matrix, if you edit the third element in the third column of the distance matrix The element is determined by the second element in the third column of the edit distance matrix, then enter t in the third element in the third column of the edit distance matrix to mark the third in the third column of the edit distance matrix The origin of each element. After the marking is completed, a trajectory matrix corresponding to the edit distance matrix is generated.
在本实施例中,第一标记模块包括:In this embodiment, the first marking module includes:
第一子标记模块,用于在每记录所述编辑距离矩阵中一个元素的值的计算轨迹时,标记所述编辑距离矩阵中该元素的值产生由来;The first sub-marking module is used to mark the origin of the value of the element in the edit distance matrix each time the calculation track of the value of an element in the edit distance matrix is recorded;
第一子标记完成模块,用于直至标记所述编辑距离矩阵中各元素的值产生由来。The first sub-marking completion module is used to mark the origin of the value of each element in the editing distance matrix.
每记录编辑距离矩阵中一个元素的值的计算轨迹,就立刻标记编辑距离矩阵中该元素的值产生由 来,也就是,一边记录编辑距离矩阵中各元素的值的计算轨迹,一边标记编辑距离矩阵中各元素的值产生由来。Every time the calculation track of the value of an element in the editing distance matrix is recorded, the origin of the value of the element in the editing distance matrix is immediately marked, that is, the editing distance matrix is marked while recording the calculation track of the value of each element in the editing distance matrix The origin of the value of each element in.
在一些实施例中,第一标记模块包括:In some embodiments, the first marking module includes:
第二子标记模块,用于在完成记录所述编辑距离矩阵中各元素的值的计算轨迹之后,根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来。The second sub-marking module is used to mark each element in the editing distance matrix according to the calculation track of the value of each element in the editing distance matrix after completing the calculation of the track for calculating the value of each element in the editing distance matrix The origin of the value.
在完成编辑距离矩阵中各元素的值的计算轨迹之后,才会触发开始执行标记编辑距离矩阵中各元素的值产生由来,直至完成标记编辑距离矩阵中各元素的值产生由来。也就是,在未完成编辑距离矩阵中各元素的值的计算轨迹,不会进行标记编辑距离矩阵中各元素的值产生由来。After the calculation of the value of each element in the edit distance matrix is completed, the origin of the value of each element in the mark edit distance matrix will be triggered until the origin of the value of each element in the mark edit distance matrix is completed. That is, the calculation trajectory of the value of each element in the unedited edit distance matrix does not cause the origin of the value of each element in the edited edit distance matrix.
筛选模块15,用于计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹。The filtering module 15 is configured to calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text, and obtain a first trajectory.
在生成轨迹矩阵之后,计算轨迹矩阵中各轨迹的相似度,在完成计算各轨迹的相似度之后,筛选部分转写文本与模板文本相似度最高的一条轨迹,获得第一轨迹,该第一轨迹认为是部分转写文本在模板文本上对应轨迹。After generating the trajectory matrix, calculate the similarity of each trajectory in the trajectory matrix. After calculating the similarity of each trajectory, filter the trajectory with the highest similarity between the partially transferred text and the template text to obtain the first trajectory It is considered that the partially transferred text corresponds to the track on the template text.
在本实施例中,筛选模块15包括:In this embodiment, the screening module 15 includes:
第三识别模块,用于识别所述轨迹矩阵中各轨迹中所述部分转写文本的字符与对应的所述模板文本的字符相等的个数,获得相等字符个数;A third recognition module, used to recognize the number of characters of the partially transferred text in each track in the track matrix equal to the corresponding characters of the template text to obtain an equal number of characters;
第一比较模块,用于比较所述轨迹矩阵中各轨迹中所述部分转写文本的字符的长度与对应的所述模板文本的字符的长度,选取长度长的作为字符总数;The first comparison module is used to compare the length of the characters of the partially transferred text in each trajectory in the trajectory matrix with the corresponding length of the characters of the template text, and select the length as the total number of characters;
第三计算模块,用于计算所述轨迹矩阵中各轨迹的相等字符个数与对应的字符总数的比值,获得所述轨迹矩阵中各轨迹的相似度。A third calculation module is used to calculate the ratio of the number of equal characters of each track in the track matrix to the total number of corresponding characters, to obtain the similarity of each track in the track matrix.
在生成轨迹矩阵之后,识别轨迹矩阵中各轨迹中部分转写文本的字符与对应的模板文本的字符相等的个数,获得相等字符个数,在获得轨迹矩阵中各轨迹中相等字符个数之后,比较轨迹矩阵中各轨迹中部分转写文本的字符的长度与对应的模板文本的字符的长度,选取长度长的作为字符总数,若轨迹矩阵中一轨迹中部分转写文本的字符的长度大于对应的模板文本的字符的长度,则在轨迹矩阵中一轨迹中选取部分转写文本的字符的长度为字符总数。若轨迹矩阵中一轨迹中部分转写文本的字符的长度小于对应的模板文本的字符的长度,则在轨迹矩阵中一轨迹中选取模板文本的字符的长度为字符总数。在选取长度长的作为字符总数之后,计算轨迹矩阵中各轨迹的相等字符个数与对应的字符总数的比值,在完成计算比值之后,获得轨迹矩阵中各轨迹的相似度。After generating the trajectory matrix, identify the number of characters of partially transferred text in each trajectory of the trajectory matrix equal to the characters of the corresponding template text to obtain the equal number of characters, after obtaining the equal number of characters in each trajectory in the trajectory matrix , Compare the character length of the partially transferred text in each track in the track matrix with the length of the character of the corresponding template text, and select the long length as the total number of characters, if the length of the part of the transferred text in a track in the track matrix is greater than Corresponding to the length of the characters of the template text, the length of the characters selected as part of the transferred text in a track in the track matrix is the total number of characters. If the length of the characters of the partially transferred text in a track in the track matrix is less than the length of the characters of the corresponding template text, the length of the characters of the template text selected in a track in the track matrix is the total number of characters. After the long length is selected as the total number of characters, the ratio of the number of equal characters of each track in the track matrix to the corresponding total number of characters is calculated, and after the calculation of the ratio is completed, the similarity of each track in the track matrix is obtained.
获得模块16,用于根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点。The obtaining module 16 is configured to determine the starting point and the ending point of the partially transferred text on the template text according to the first trajectory, and obtain the first starting point and the first ending point.
在获得第一轨迹之后,根据第一轨迹,由于第一轨迹在轨迹矩阵中存在始点和终点,确定部分转写 文本在模板文本上对应的始点和终点,从而获得第一始点和第一终点。After obtaining the first trajectory, according to the first trajectory, since the first trajectory has a start point and an end point in the trajectory matrix, the start point and the end point corresponding to the part of the transferred text on the template text are determined to obtain the first start point and the first end point.
在本实施例中,获得模块16包括:In this embodiment, the obtaining module 16 includes:
第二标记模块,用于标记所述第一轨迹中第一个元素和最后一个元素;A second marking module, used to mark the first element and the last element in the first track;
第一获得模块,用于根据所述第一轨迹中第一个元素和最后一个元素,标记对应的所述模板文本的字符,获得第一始点、第一终点。The first obtaining module is configured to mark the characters of the template text corresponding to the first element and the last element in the first track to obtain a first start point and a first end point.
在获得第一轨迹之后,标记第一轨迹中第一个元素,根据第一轨迹中第一个元素,获得第一轨迹中第一个元素对应列上的模板文本的字符,标记对应的模板文本的字符,从而获得第一始点。After obtaining the first track, mark the first element in the first track, according to the first element in the first track, obtain the characters of the template text on the column corresponding to the first element in the first track, and mark the corresponding template text Character to obtain the first starting point.
在获得第一轨迹之后,标记第一轨迹中最后一个元素,根据第一轨迹中最后一个元素,获得第一轨迹中最后一个元素对应列上的模板文本的字符,标记对应的模板文本的字符,从而获得第一终点。After obtaining the first track, mark the last element in the first track, and according to the last element in the first track, obtain the character of the template text on the column corresponding to the last element in the first track, and mark the character of the corresponding template text, Thereby obtaining the first end point.
第二获取模块17,用于根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本。The second obtaining module 17 is configured to obtain a new template text from the template text according to the first start point and the first end point.
在获得第一始点、第一终点之后,根据模板文本的第一始点和第一终点,获得两点之间的文本,包括模第一始点、第一终点分别对应的字符,从而从模板文本中获取新模板文本。After obtaining the first start point and the first end point, according to the first start point and the first end point of the template text, the text between the two points is obtained, including the characters corresponding to the first start point and the first end point of the template, respectively. Get the new template text.
在本实施例中,第二获取模块17包括:In this embodiment, the second acquisition module 17 includes:
截取模块,用于截取所述第一始点与所述第一终点之间的字符,其中所述第一始点与所述第一终点之间的字符包括所述第一始点对应的字符和所述第一终点对应的字符;An interception module, configured to intercept characters between the first start point and the first end point, wherein characters between the first start point and the first end point include characters corresponding to the first start point and the The character corresponding to the first end point;
第二子获取模块,用于根据截取到的字符生成文本,获取所述新模板文本。The second sub-acquisition module is used for generating text according to the intercepted characters and acquiring the new template text.
在得到第一始点、第一终点之后,截取模板文本的第一始点与第一终点之间的字符,其中模板文本的第一始点与第一终点之间的字符包括模板文本的第一始点对应的字符和第一终点对应的字符。在截取到的字符之后,根据截取到的字符,生成与模板文本同样格式的文本,获得新模板文本。After the first start point and the first end point are obtained, characters between the first start point and the first end point of the template text are intercepted, wherein the characters between the first start point and the first end point of the template text include the first start point of the template text corresponding to Character corresponding to the first end point. After the intercepted characters, according to the intercepted characters, the text in the same format as the template text is generated to obtain the new template text.
第二计算模块18,用于将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。The second calculation module 18 is used to compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
在获得新模板文本之后,将部分转写文本与新模板文本进行对比,并不是与模板文本进行对比,通过编辑距离算法计算部分转写文本的准确率,从而解决现有的文本的转写准确率算法,将已经转写出来的文本与模板文本的全部文本进行对比,在部分文本被转写出来的时候,不能准确计算文本的转写准确率的问题。After obtaining the new template text, part of the transferred text is compared with the new template text, not the template text. The accuracy of the part of the transferred text is calculated by the edit distance algorithm, thereby solving the accuracy of the existing text. The rate algorithm compares the text that has been transferred to the entire text of the template text. When part of the text is transferred, the problem of the accuracy of the text transfer cannot be accurately calculated.
如图3所示,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储基于语义解析的文本准确率计算方法的模型等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处 理器执行时以实现一种基于语义解析的文本准确率计算方法。As shown in FIG. 3, a computer device is also provided in an embodiment of the present application. The computer device may be a server, and its internal structure may be as shown in FIG. The computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the processor designed by the computer is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory device provides an environment for operating systems and computer programs in non-volatile storage media. The database of the computer device is used to store data such as a model of a text accuracy calculation method based on semantic analysis. The network interface of the computer device is used to communicate with external terminals through a network connection. The computer program is executed by the processor to realize a text accuracy calculation method based on semantic analysis.
上述处理器执行上述基于语义解析的文本准确率计算方法的步骤:获取从模板文本除始点之外的任意位置开始被转写的部分转写文本;以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵;根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值;记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵;计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹;根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点;根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本;将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。The above-mentioned processor executes the steps of the above-mentioned method for calculating the text accuracy based on semantic analysis: acquiring part of the transferred text starting from any position of the template text except the starting point; adding two characters by the length of the template text characters The length is the number of columns, and the length of the part of the transferred text characters is increased by the length of two characters as the number of lines, and an edit distance matrix is established; based on the part of the transferred text and the template text, the edit distance matrix is calculated The value of each element in the editing distance; record the calculated track of the value of each element in the editing distance matrix to generate a track matrix corresponding to the editing distance matrix; calculate the similarity of each track in the track matrix and filter the partial transfer A track with the highest similarity between the written text and the template text to obtain a first track; based on the first track, determine the corresponding starting point and end point of the partially transferred text on the template text to obtain the first starting point and The first end point; according to the first start point and the first end point, obtain a new template text from the template text; compare the partially transliterated text with the new template text, and calculate by the edit distance algorithm The accuracy rate of the transliterated text.
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
本申请一实施例还提供一种计算机非易失性可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现一种基于语义解析的文本准确率计算方法,具体为:获取从模板文本除始点之外的任意位置开始被转写的部分转写文本;以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵;根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值;记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵;计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹;根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点;根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本;将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。An embodiment of the present application also provides a computer non-volatile readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, a text accuracy calculation method based on semantic analysis is implemented, specifically: acquiring The part of the transferred text starting from any position other than the starting point of the template text; the length of the template text characters is increased by two characters as the number of columns, and the length of the part of the transferred text characters is increased by two The length of the characters is the number of lines, and an edit distance matrix is established; based on the partially transferred text and the template text, the value of each element in the edit distance matrix is calculated; the value of each element in the edit distance matrix is recorded Calculate the trajectory to generate a trajectory matrix corresponding to the editing distance matrix; calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text to obtain the first trajectory ; According to the first trajectory, determine the start point and end point of the partially transferred text on the template text to obtain the first start point and the first end point; according to the first start point and the first end point, from Obtain the new template text from the template text; compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.

Claims (20)

  1. 一种基于语义解析的文本准确率计算方法,其特征在于,所述方法包括:A text accuracy calculation method based on semantic analysis, characterized in that the method includes:
    获取从模板文本除始点之外的任意位置开始被转写的部分转写文本;Get the part of the transferred text starting from any position of the template text except the starting point;
    以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵;The length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of lines, and an editing distance matrix is established;
    根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值;Calculate the value of each element in the editing distance matrix according to the partially transferred text and the template text;
    记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵;Record the calculated trajectory of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix;
    计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹;Calculating the similarity of each trajectory in the trajectory matrix, filtering a trajectory with the highest degree of similarity between the partially transferred text and the template text to obtain a first trajectory;
    根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点;According to the first trajectory, determine the start point and end point of the partially transferred text on the template text, and obtain the first start point and the first end point;
    根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本;Obtain a new template text from the template text according to the first start point and the first end point;
    将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。Compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
  2. 根据权利要求1所述的基于语义解析的文本准确率计算方法,其特征在于,在所述以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵的步骤之后,在所述根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值的步骤之前,包括:The method for calculating text accuracy based on semantic analysis according to claim 1, characterized in that, in the method, the length of the template text character is increased by two characters as the number of columns, and the text character is partially transferred After the step of establishing the editing distance matrix, before the step of calculating the value of each element in the editing distance matrix based on the partial transfer text and the template text ,include:
    从所述编辑距离矩阵的第一行的第三个元素开始输入所述模板文本的字符;Input characters of the template text from the third element of the first row of the edit distance matrix;
    从所述编辑距离矩阵的第一列的第三个元素开始输入所述部分转写文本的字符;Starting from the third element of the first column of the edit distance matrix, input the characters of the partially transferred text;
    定义所述编辑距离矩阵的第二行中的第二个元素的值为0;Define that the value of the second element in the second row of the edit distance matrix is 0;
    以所述编辑距离矩阵的第二行中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二行的各元素的值;Initialize the value of each element in the second row of the editing distance matrix with the value of the second element in the second row of the editing distance matrix increasing by a value of 1 in turn;
    以所述编辑距离矩阵的第二列中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二列的各元素的值。The value of each element in the second column of the edit distance matrix is sequentially incremented by a value of 1 to initialize the value of each element in the second column of the edit distance matrix.
  3. 根据权利要求2所述的基于语义解析的文本准确率计算方法,其特征在于,所述编辑距离矩阵中未被初始化的各元素的值由其左方、左上角、上方中的某一个元素的值来确定,在所述根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值的步骤中,包括:The method for calculating text accuracy based on semantic analysis according to claim 2, wherein the value of each element in the editing distance matrix that is not initialized is determined by the value of one of the elements in the left, upper left, and upper Determined by the value, the step of calculating the value of each element in the editing distance matrix based on the partially transferred text and the template text includes:
    识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数;Identify the number of columns and rows of the third element in the third column of the edit distance matrix;
    识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数分别对应所述模板文本的字 符、所述部分转写文本的字符;Identify the number of columns and rows where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the character of the partially transferred text;
    判断所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符是否相等;Judging that the number of columns where the third element in the third column of the editing distance matrix corresponds to the character of the template text corresponds to the number of rows where the third element in the third column of the editing distance matrix is Whether the characters in the transliterated text are equal;
    若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左上角的元素的值;If the number of columns of the third element in the third column of the edit distance matrix corresponds to the character of the template text and the number of rows of the third element in the third column of the edit distance matrix corresponds to If the characters of the transferred text are equal, the value of the third element in the third column of the edit distance matrix is the value of the element in the upper left corner;
    若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符不相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左方、左上角、上方的元素中最小值加1得到;If the number of columns of the third element in the third column of the edit distance matrix corresponds to the character of the template text and the number of rows of the third element in the third column of the edit distance matrix corresponds to If the characters of the partially transferred text are not equal, the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus 1;
    依次计算所述编辑距离矩阵的第三列中的第四个元素的值,直至完成计算所述编辑距离矩阵中各元素的值。Calculate the value of the fourth element in the third column of the edit distance matrix in sequence until the value of each element in the edit distance matrix is completed.
  4. 根据权利要求1所述的基于语义解析的文本准确率计算方法,其特征在于,在所述记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵的步骤中,包括:The method for calculating text accuracy based on semantic analysis according to claim 1, characterized in that a trajectory matrix corresponding to the edit distance matrix is generated in the calculation trajectory that records the value of each element in the edit distance matrix The steps include:
    记录所述编辑距离矩阵中各元素的值的计算轨迹;Record the calculated trajectory of each element in the edit distance matrix;
    根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来;Mark the origin of the value of each element in the edit distance matrix according to the calculation trajectory of the value of each element in the edit distance matrix;
    在完成标记后,生成与所述编辑距离矩阵对应的轨迹矩阵。After the marking is completed, a trajectory matrix corresponding to the editing distance matrix is generated.
  5. 根据权利要求1所述的基于语义解析的文本准确率计算方法,其特征在于,在所述计算所述轨迹矩阵中各轨迹的相似度的步骤中,包括:The method for calculating text accuracy based on semantic analysis according to claim 1, wherein the step of calculating the similarity of each trajectory in the trajectory matrix includes:
    识别所述轨迹矩阵中各轨迹中所述部分转写文本的字符与对应的所述模板文本的字符相等的个数,获得相等字符个数;Identifying the number of characters of the partially transferred text in each track in the track matrix that are equal to the corresponding characters of the template text to obtain an equal number of characters;
    比较所述轨迹矩阵中各轨迹中所述部分转写文本的字符的长度与对应的所述模板文本的字符的长度,选取长度长的作为字符总数;Comparing the character length of the partially transferred text in each track in the track matrix with the corresponding character length of the template text, and selecting the long length as the total number of characters;
    计算所述轨迹矩阵中各轨迹的相等字符个数与对应的字符总数的比值,获得所述轨迹矩阵中各轨迹的相似度。The ratio of the number of equal characters of each track in the track matrix to the total number of corresponding characters is calculated to obtain the similarity of each track in the track matrix.
  6. 根据权利要求1所述的基于语义解析的文本准确率计算方法,其特征在于,在所述根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点的步骤中,包括:The method for calculating text accuracy based on semantic analysis according to claim 1, characterized in that, according to the first trajectory, the start point and the end point of the partially transferred text on the template text are determined, The steps to obtain the first start point and the first end point include:
    标记所述第一轨迹中第一个元素和最后一个元素;Mark the first element and the last element in the first track;
    根据所述第一轨迹中第一个元素和最后一个元素,标记对应的所述模板文本的字符,分别获得第一始点、第一终点。Mark the corresponding characters of the template text according to the first element and the last element in the first track to obtain the first start point and the first end point, respectively.
  7. 根据权利要求6所述的基于语义解析的文本准确率计算方法,其特征在于,在所述根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本的步骤中,包括:The text accuracy calculation method based on semantic analysis according to claim 6, characterized in that in the step of acquiring a new template text from the template text according to the first start point and the first end point ,include:
    截取所述第一始点与所述第一终点之间的字符,其中所述第一始点与所述第一终点之间的字符包括所述第一始点对应的字符和所述第一终点对应的字符;Intercept characters between the first start point and the first end point, where the characters between the first start point and the first end point include characters corresponding to the first start point and the first end point character;
    根据截取到的字符生成文本,获得所述新模板文本。Generate text based on the intercepted characters to obtain the new template text.
  8. 一种基于语义解析的文本准确率计算装置,其特征在于,所述装置包括:A text accuracy calculation device based on semantic analysis, characterized in that the device includes:
    第一获取模块,用于获取从模板文本除始点之外的任意位置开始被转写的部分转写文本;The first obtaining module is used to obtain the part of the transferred text starting from any position of the template text except the starting point;
    建立模块,用于以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵;Establishing a module, used to increase the length of the template text characters by two characters as the number of columns, and the length of the partially transferred text characters by two characters as the number of rows, to establish an editing distance matrix;
    第一计算模块,用于根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值;A first calculation module, configured to calculate the value of each element in the edit distance matrix according to the partially transferred text and the template text;
    生成模块,用于记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵;A generating module, used to record the calculated trajectory of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix;
    筛选模块,用于计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹;The screening module is used to calculate the similarity of each trajectory in the trajectory matrix, filter a trajectory with the highest similarity between the partially transferred text and the template text, and obtain a first trajectory;
    获得模块,用于根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点;An obtaining module, configured to determine a starting point and an end point of the partially transferred text on the template text according to the first trajectory, and obtain a first starting point and a first end point;
    第二获取模块,用于根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本;A second obtaining module, configured to obtain a new template text from the template text according to the first start point and the first end point;
    第二计算模块,用于将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。The second calculation module is used to compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
  9. 根据权利要求8所述的基于语义解析的文本准确率计算装置,其特征在于,所述装置包括:The text accuracy calculation device based on semantic analysis according to claim 8, characterized in that the device comprises:
    第一输入模块,用于从所述编辑距离矩阵的第一行的第三个元素开始输入所述模板文本的字符;A first input module for inputting characters of the template text starting from the third element of the first row of the editing distance matrix;
    第二输入模块,用于从所述编辑距离矩阵的第一列的第三个元素开始输入所述部分转写文本的字符;A second input module for inputting characters of the partially transferred text starting from the third element of the first column of the editing distance matrix;
    定义模块,用于定义所述编辑距离矩阵的第二行中的第二个元素的值为0;A definition module, used to define the value of the second element in the second row of the edit distance matrix as 0;
    第一初始化模块,用于以所述编辑距离矩阵的第二行中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二行的各元素的值;The first initialization module is used to sequentially increment the value 1 by the value of the second element in the second row of the edit distance matrix to initialize the value of each element of the second row of the edit distance matrix;
    第二初始化模块,用于以所述编辑距离矩阵的第二列中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二列的各元素的值。A second initialization module is used to sequentially increment the value 1 by the value of the second element in the second column of the edit distance matrix to initialize the value of each element of the second column of the edit distance matrix.
  10. 根据权利要求9所述的基于语义解析的文本准确率计算装置,其特征在于,所述第一计算模块包括:The text accuracy calculation device based on semantic analysis according to claim 9, wherein the first calculation module comprises:
    第一识别模块,用于识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数;A first identification module, used to identify the number of columns and rows of the third element in the third column of the editing distance matrix;
    第二识别模块,用于识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数分别对应所述模板文本的字符、所述部分转写文本的字符;A second recognition module, used to recognize the number of columns and rows where the third element in the third column of the edit distance matrix corresponds to the characters of the template text and the characters of the partially transferred text, respectively;
    第一判断模块,用于判断所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文 本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符是否相等;若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左上角的元素的值;若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符不相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左方、左上角、上方的元素中最小值加1得到;The first judgment module is used to judge the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix Whether the number of rows of elements corresponds to whether the characters of the partially transferred text are equal; if the number of columns of the third element in the third column of the editing distance matrix corresponds to the characters of the template text and the editing distance If the number of rows of the third element in the third column of the matrix is equal to the characters of the partially transferred text, the value of the third element in the third column of the edit distance matrix is the element at the upper left corner If the number of columns where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the third element in the third column of the edit distance matrix are in the row The numbers corresponding to the characters of the partially transferred text are not equal, then the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus 1;
    第一子计算模块,用于依次计算所述编辑距离矩阵的第三列中的第四个元素的值,直至完成计算所述编辑距离矩阵中各元素的值。The first sub-calculation module is used to sequentially calculate the value of the fourth element in the third column of the edit distance matrix until the calculation of the value of each element in the edit distance matrix is completed.
  11. 根据权利要求8所述的基于语义解析的文本准确率计算装置,其特征在于,所述生成模块包括:The text accuracy calculation device based on semantic analysis according to claim 8, characterized in that the generation module comprises:
    第一记录模块,用于记录所述编辑距离矩阵中各元素的值的计算轨迹;The first recording module is used to record the calculation track of the values of each element in the editing distance matrix;
    第一标记模块,用于根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来;The first marking module is used to mark the origin of the value of each element in the editing distance matrix according to the calculation trajectory of the value of each element in the editing distance matrix;
    第一生成模块,用于在完成标记后,生成与所述编辑距离矩阵对应的轨迹矩阵。The first generating module is used to generate a trajectory matrix corresponding to the editing distance matrix after the marking is completed.
  12. 根据权利要求8所述的基于语义解析的文本准确率计算装置,其特征在于,所述筛选模块包括:The text accuracy calculation device based on semantic analysis according to claim 8, characterized in that the screening module comprises:
    第三识别模块,用于识别所述轨迹矩阵中各轨迹中所述部分转写文本的字符与对应的所述模板文本的字符相等的个数,获得相等字符个数;A third recognition module, used to recognize the number of characters of the partially transferred text in each track in the track matrix equal to the corresponding characters of the template text to obtain an equal number of characters;
    第一比较模块,用于比较所述轨迹矩阵中各轨迹中所述部分转写文本的字符的长度与对应的所述模板文本的字符的长度,选取长度长的作为字符总数;The first comparison module is used to compare the length of the characters of the partially transferred text in each trajectory in the trajectory matrix with the corresponding length of the characters of the template text, and select the length as the total number of characters;
    第三计算模块,用于计算所述轨迹矩阵中各轨迹的相等字符个数与对应的字符总数的比值,获得所述轨迹矩阵中各轨迹的相似度。A third calculation module is used to calculate the ratio of the number of equal characters of each track in the track matrix to the total number of corresponding characters, to obtain the similarity of each track in the track matrix.
  13. 根据权利要求8所述的基于语义解析的文本准确率计算装置,其特征在于,所述获得模块包括:The text accuracy calculation device based on semantic analysis according to claim 8, wherein the obtaining module comprises:
    第二标记模块,用于标记所述第一轨迹中第一个元素和最后一个元素;A second marking module, used to mark the first element and the last element in the first track;
    第一获得模块,用于根据所述第一轨迹中第一个元素和最后一个元素,标记对应的所述模板文本的字符,分别获得第一始点、第一终点。The first obtaining module is configured to mark characters of the template text corresponding to the first element and the last element in the first track, and obtain a first start point and a first end point, respectively.
  14. 根据权利要求13所述的基于语义解析的文本准确率计算装置,其特征在于,所述第二获取模块包括:The text accuracy calculation device based on semantic analysis according to claim 13, wherein the second acquisition module includes:
    截取模块,用于截取所述第一始点与所述第一终点之间的字符,其中所述第一始点与所述第一终点之间的字符包括所述第一始点对应的字符和所述第一终点对应的字符;An interception module, configured to intercept characters between the first start point and the first end point, wherein characters between the first start point and the first end point include characters corresponding to the first start point and the The character corresponding to the first end point;
    第二子获取模块,用于根据截取到的字符生成文本,获得所述新模板文本。The second sub-acquisition module is used to generate text according to the intercepted characters and obtain the new template text.
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处 理器执行所述计算机程序时实现基于语义解析的文本准确率计算方法,所述方法包括:A computer device includes a memory and a processor, and the memory stores a computer program, wherein the processor implements a semantic accuracy-based text accuracy calculation method when the processor executes the computer program, the method includes:
    获取从模板文本除始点之外的任意位置开始被转写的部分转写文本;Get the part of the transferred text starting from any position of the template text except the starting point;
    以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵;The length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of lines, and an editing distance matrix is established;
    根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值;Calculate the value of each element in the editing distance matrix according to the partially transferred text and the template text;
    记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵;Record the calculated trajectory of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix;
    计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹;Calculating the similarity of each trajectory in the trajectory matrix, filtering a trajectory with the highest degree of similarity between the partially transferred text and the template text to obtain a first trajectory;
    根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点;According to the first trajectory, determine the start point and end point of the partially transferred text on the template text, and obtain the first start point and the first end point;
    根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本;Obtain a new template text from the template text according to the first start point and the first end point;
    将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。Compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
  16. 根据权利要求15所述的计算机设备,其特征在于,在所述以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵的步骤之后,在所述根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值的步骤之前,包括:The computer device according to claim 15, wherein the length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters The length is the number of lines, and after the step of establishing the editing distance matrix, before the step of calculating the value of each element in the editing distance matrix based on the partially transferred text and the template text, it includes:
    从所述编辑距离矩阵的第一行的第三个元素开始输入所述模板文本的字符;Input characters of the template text from the third element of the first row of the edit distance matrix;
    从所述编辑距离矩阵的第一列的第三个元素开始输入所述部分转写文本的字符;Starting from the third element of the first column of the edit distance matrix, input the characters of the partially transferred text;
    定义所述编辑距离矩阵的第二行中的第二个元素的值为0;Define that the value of the second element in the second row of the edit distance matrix is 0;
    以所述编辑距离矩阵的第二行中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二行的各元素的值;Initialize the value of each element in the second row of the editing distance matrix with the value of the second element in the second row of the editing distance matrix increasing by a value of 1 in turn;
    以所述编辑距离矩阵的第二列中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二列的各元素的值。The value of each element in the second column of the edit distance matrix is sequentially incremented by a value of 1 to initialize the value of each element in the second column of the edit distance matrix.
  17. 根据权利要求16所述的计算机设备,其特征在于,所述编辑距离矩阵中未被初始化的各元素的值由其左方、左上角、上方中的某一个元素的值来确定,在所述根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值的步骤中,包括:The computer device according to claim 16, characterized in that the value of each element in the editing distance matrix that is not initialized is determined by the value of one of the elements on the left, upper left corner, and upper side. The step of calculating the value of each element in the edit distance matrix according to the partially transferred text and the template text includes:
    识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数;Identify the number of columns and rows of the third element in the third column of the edit distance matrix;
    识别所述编辑距离矩阵的第三列中的第三个元素所处于的列数、行数分别对应所述模板文本的字符、所述部分转写文本的字符;Identify the number of columns and rows where the third element in the third column of the edit distance matrix corresponds to the character of the template text and the character of the partially transferred text, respectively;
    判断所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符是否相等;Judging that the number of columns where the third element in the third column of the editing distance matrix corresponds to the character of the template text corresponds to the number of rows where the third element in the third column of the editing distance matrix is Whether the characters in the transliterated text are equal;
    若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左上角的元素的值;If the number of columns of the third element in the third column of the edit distance matrix corresponds to the character of the template text and the number of rows of the third element in the third column of the edit distance matrix corresponds to If the characters of the transferred text are equal, the value of the third element in the third column of the edit distance matrix is the value of the element in the upper left corner;
    若所述编辑距离矩阵的第三列中的第三个元素所处于的列数对应所述模板文本的字符与所述编辑距离矩阵的第三列中的第三个元素所处于行数对应所述部分转写文本的字符不相等,则所述编辑距离矩阵的第三列中的第三个元素的值为其左方、左上角、上方的元素中最小值加1得到;If the number of columns of the third element in the third column of the edit distance matrix corresponds to the character of the template text and the number of rows of the third element in the third column of the edit distance matrix corresponds to If the characters of the partially transferred text are not equal, the value of the third element in the third column of the edit distance matrix is the minimum value of the left, upper left, and upper elements plus 1;
    依次计算所述编辑距离矩阵的第三列中的第四个元素的值,直至完成计算所述编辑距离矩阵中各元素的值。Calculate the value of the fourth element in the third column of the edit distance matrix in sequence until the value of each element in the edit distance matrix is completed.
  18. 根据权利要求15所述的计算机设备,其特征在于,在所述记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵的步骤中,包括:The computer device according to claim 15, wherein the step of generating a trajectory matrix corresponding to the editing distance matrix in the calculation trajectory of recording the values of the elements in the editing distance matrix includes:
    记录所述编辑距离矩阵中各元素的值的计算轨迹;Record the calculated trajectory of each element in the edit distance matrix;
    根据所述编辑距离矩阵中各元素的值的计算轨迹,标记所述编辑距离矩阵中各元素的值产生由来;Mark the origin of the value of each element in the edit distance matrix according to the calculation trajectory of the value of each element in the edit distance matrix;
    在完成标记后,生成与所述编辑距离矩阵对应的轨迹矩阵。After the marking is completed, a trajectory matrix corresponding to the editing distance matrix is generated.
  19. 一种计算机非易失性可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现基于语义解析的文本准确率计算方法,所述方法包括:A computer non-volatile readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, a text accuracy calculation method based on semantic analysis is implemented. The method includes:
    获取从模板文本除始点之外的任意位置开始被转写的部分转写文本;Get the part of the transferred text starting from any position of the template text except the starting point;
    以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵;The length of the template text characters is increased by two characters as the number of columns, and the length of the partially transferred text characters is increased by two characters as the number of lines, and an editing distance matrix is established;
    根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值;Calculate the value of each element in the editing distance matrix according to the partially transferred text and the template text;
    记录所述编辑距离矩阵中各元素的值的计算轨迹,生成与所述编辑距离矩阵对应的轨迹矩阵;Record the calculated trajectory of each element in the edit distance matrix, and generate a trajectory matrix corresponding to the edit distance matrix;
    计算所述轨迹矩阵中各轨迹的相似度,筛选所述部分转写文本与所述模板文本相似度最高的一条轨迹,获得第一轨迹;Calculating the similarity of each trajectory in the trajectory matrix, filtering a trajectory with the highest degree of similarity between the partially transferred text and the template text to obtain a first trajectory;
    根据所述第一轨迹,确定所述部分转写文本在所述模板文本上对应的始点和终点,获得第一始点和第一终点;According to the first trajectory, determine the start point and end point of the partially transferred text on the template text, and obtain the first start point and the first end point;
    根据所述第一始点和所述第一终点,从所述模板文本中获取新模板文本;Obtain a new template text from the template text according to the first start point and the first end point;
    将所述部分转写文本与所述新模板文本进行对比,通过编辑距离算法计算所述部分转写文本的准确率。Compare the partially transferred text with the new template text, and calculate the accuracy of the partially transferred text through an edit distance algorithm.
  20. 根据权利要求19所述的计算机非易失性可读存储介质,其特征在于,在所述以所述模板文本字符的长度增加两个字符的长度为列数、以所述部分转写文本字符的长度增加两个字符的长度为行数,建立编辑距离矩阵的步骤之后,在所述根据所述部分转写文本、所述模板文本,计算所述编辑距离矩阵中各元素的值的步骤之前,包括:The computer non-volatile readable storage medium according to claim 19, wherein the length of the template text character is increased by two characters as the number of columns, and the partially transferred text character is used After the step of establishing the editing distance matrix, before the step of calculating the value of each element in the editing distance matrix based on the partial transfer text and the template text ,include:
    从所述编辑距离矩阵的第一行的第三个元素开始输入所述模板文本的字符;Input characters of the template text from the third element of the first row of the edit distance matrix;
    从所述编辑距离矩阵的第一列的第三个元素开始输入所述部分转写文本的字符;Starting from the third element of the first column of the edit distance matrix, input the characters of the partially transferred text;
    定义所述编辑距离矩阵的第二行中的第二个元素的值为0;Define that the value of the second element in the second row of the edit distance matrix is 0;
    以所述编辑距离矩阵的第二行中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二行的各元素的值;Initialize the value of each element in the second row of the editing distance matrix with the value of the second element in the second row of the editing distance matrix increasing by a value of 1 in turn;
    以所述编辑距离矩阵的第二列中的第二个元素的值为0依次递增数值1,初始化所述编辑距离矩阵的第二列的各元素的值。The value of each element in the second column of the edit distance matrix is sequentially incremented by a value of 1 to initialize the value of each element in the second column of the edit distance matrix.
PCT/CN2018/124399 2018-11-13 2018-12-27 Text accuracy calculation method and apparatus based on semantic parsing, and computer device WO2020098099A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811347235.2A CN109657210B (en) 2018-11-13 2018-11-13 Text accuracy rate calculation method and device based on semantic analysis and computer equipment
CN201811347235.2 2018-11-13

Publications (1)

Publication Number Publication Date
WO2020098099A1 true WO2020098099A1 (en) 2020-05-22

Family

ID=66110906

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124399 WO2020098099A1 (en) 2018-11-13 2018-12-27 Text accuracy calculation method and apparatus based on semantic parsing, and computer device

Country Status (2)

Country Link
CN (1) CN109657210B (en)
WO (1) WO2020098099A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725909A (en) * 2024-02-18 2024-03-19 四川日报网络传媒发展有限公司 Multi-dimensional comment auditing method and device, electronic equipment and storage medium
CN117725909B (en) * 2024-02-18 2024-05-14 四川日报网络传媒发展有限公司 Multi-dimensional comment auditing method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8001136B1 (en) * 2007-07-10 2011-08-16 Google Inc. Longest-common-subsequence detection for common synonyms
CN103699591A (en) * 2013-12-11 2014-04-02 湖南大学 Page body extraction method based on sample page
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5544602B2 (en) * 2010-11-15 2014-07-09 株式会社日立製作所 Word semantic relationship extraction apparatus and word semantic relationship extraction method
CN102622338B (en) * 2012-02-24 2014-02-26 北京工业大学 Computer-assisted computing method of semantic distance between short texts
CN105183732A (en) * 2014-06-04 2015-12-23 广州市动景计算机科技有限公司 Method and device for processing webpage
CN111324784B (en) * 2015-03-09 2023-05-16 创新先进技术有限公司 Character string processing method and device
CN105117054B (en) * 2015-08-12 2018-04-17 珠海优特物联科技有限公司 A kind of recognition methods of handwriting input and system
CN106372061B (en) * 2016-09-12 2020-11-24 电子科技大学 Short text similarity calculation method based on semantics
CN107885718B (en) * 2016-09-30 2020-01-24 腾讯科技(深圳)有限公司 Semantic determination method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8001136B1 (en) * 2007-07-10 2011-08-16 Google Inc. Longest-common-subsequence detection for common synonyms
CN103699591A (en) * 2013-12-11 2014-04-02 湖南大学 Page body extraction method based on sample page
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725909A (en) * 2024-02-18 2024-03-19 四川日报网络传媒发展有限公司 Multi-dimensional comment auditing method and device, electronic equipment and storage medium
CN117725909B (en) * 2024-02-18 2024-05-14 四川日报网络传媒发展有限公司 Multi-dimensional comment auditing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109657210A (en) 2019-04-19
CN109657210B (en) 2023-10-10

Similar Documents

Publication Publication Date Title
TWI621077B (en) Character recognition method and server for claim documents
US10650192B2 (en) Method and device for recognizing domain named entity
JP5175206B2 (en) Automatic detection and application of editing patterns in draft documents
WO2022142613A1 (en) Training corpus expansion method and apparatus, and intent recognition model training method and apparatus
JP2021512519A5 (en)
US7372993B2 (en) Gesture recognition
JP5917804B2 (en) Document editing using anchors
WO2020215696A1 (en) Method for extracting video subtitles, device, computer apparatus and storage medium
CN111680634B (en) Document file processing method, device, computer equipment and storage medium
CN109657675B (en) Image annotation method and device, computer equipment and readable storage medium
Berg-Kirkpatrick et al. Improved typesetting models for historical OCR
WO2019153587A1 (en) User identity authentication method and apparatus, computer device and storage medium
WO2021120664A1 (en) Abnormal inode dynamic repair method and system, and related component
CN110321142A (en) A kind of interface document update method, device, electronic equipment and storage medium
US20110295881A1 (en) Merging computer product, method, and apparatus
CN111325031A (en) Resume parsing method and device
WO2020098099A1 (en) Text accuracy calculation method and apparatus based on semantic parsing, and computer device
WO2020098098A1 (en) Semantic analysis-based text accuracy calculation method, device and computer device
CN111159997B (en) Intelligent verification method for enterprise bidding document
CN106095808A (en) The method and apparatus that a kind of MDB file fragmentation recovers
CN116362219A (en) Information extraction template generation method and device, medium and equipment
WO2021017281A1 (en) Data processing method and system, computer device, and storage medium
KR101449725B1 (en) Apparatus and method for converting pdf document
CN114115810A (en) Natural language requirement detection and analysis method based on description template
US8775528B2 (en) Computer readable recording medium storing linking keyword automatically extracting program, linking keyword automatically extracting method and apparatus

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18940473

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18940473

Country of ref document: EP

Kind code of ref document: A1