WO2023001308A1 - 文本识别方法及装置、计算机可读存储介质和电子设备 - Google Patents

文本识别方法及装置、计算机可读存储介质和电子设备 Download PDF

Info

Publication number
WO2023001308A1
WO2023001308A1 PCT/CN2022/107580 CN2022107580W WO2023001308A1 WO 2023001308 A1 WO2023001308 A1 WO 2023001308A1 CN 2022107580 W CN2022107580 W CN 2022107580W WO 2023001308 A1 WO2023001308 A1 WO 2023001308A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
recognized
recognition
word vector
recognition result
Prior art date
Application number
PCT/CN2022/107580
Other languages
English (en)
French (fr)
Inventor
李发科
王为磊
屠昶旸
张济徽
Original Assignee
智慧芽信息科技(苏州)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 智慧芽信息科技(苏州)有限公司 filed Critical 智慧芽信息科技(苏州)有限公司
Publication of WO2023001308A1 publication Critical patent/WO2023001308A1/zh
Priority to US18/217,766 priority Critical patent/US20230351110A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • G06Q50/184Intellectual property management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/817Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level by voting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/268Lexical context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/11Patent retrieval

Definitions

  • the present application relates to the technical field of text processing, in particular to a text recognition method and device, a computer-readable storage medium, and electronic equipment.
  • Patent documents mainly include three parts: technical problem, technical solution and technical efficacy. Patent documents can be classified in detail according to the technical functions of patent documents.
  • the embodiments of the present application provide a text recognition method and device, a computer-readable storage medium, and an electronic device, which solve the problems of inaccurate and low efficiency of text recognition.
  • a text recognition method includes: determining a plurality of character strings corresponding to the text to be recognized based on the text to be recognized, wherein, among the plurality of character strings, the adjacent The character strings partially overlap; word vector conversion is performed on the plurality of character strings to obtain a plurality of word vectors, wherein the plurality of word vectors and the plurality of character strings are in a one-to-one correspondence; based on the plurality of word vectors Generate word vector recognition results corresponding to multiple word vectors, wherein the word vector recognition results are functional text or non-functional text; and determine the text recognition result of the text to be recognized based on the word vector recognition results corresponding to the multiple word vectors.
  • the number of multiple character strings is M, and M is a positive integer greater than 1, and determining multiple character strings corresponding to the text to be recognized based on the text to be recognized includes: based on the text to be recognized Recognize the first string whose length is determined by the preset string; use the character in the Nth string as the starting character of the N+1th string, and decompose it based on the preset string length and the preset decomposition step Recognize the text to obtain the N+1th character string, where N is a positive integer greater than or equal to 1 and less than M.
  • determining the text recognition result of the text to be recognized based on the word vector recognition results corresponding to each of the multiple word vectors includes: using a voting mechanism to select the word corresponding to each of the multiple word vectors The vector recognition results are voted to determine the text recognition results of the text to be recognized.
  • a voting mechanism is adopted to perform a voting operation on the word vector recognition results corresponding to each of the multiple word vectors, and determine the text recognition result of the text to be recognized, including: determining based on the text to be recognized A plurality of units to be recognized corresponding to the text to be recognized, wherein the unit to be recognized corresponds to at least one word vector; for each unit to be recognized in the plurality of units to be recognized, a voting mechanism is used to perform a recognition result of the word vector corresponding to the unit to be recognized The voting operation determines the text recognition result of the unit to be recognized; and determines the text recognition result corresponding to the text to be recognized based on the text recognition results corresponding to the plurality of units to be recognized.
  • the text to be recognized is a patent text
  • the units to be recognized include at least one of sentences, paragraphs, and text modules in the patent text; wherein, the text modules include abstract modules, rights At least one of the requirement book module and specification module.
  • a voting mechanism is used to perform a voting operation on the word vector recognition result corresponding to the unit to be recognized, and determine the unit to be recognized Text recognition results, including: for each unit to be recognized in a plurality of units to be recognized, if in the word vector recognition result corresponding to the unit to be recognized, the number of functional texts is greater than or equal to the number of non-functional texts, determine the unit to be recognized
  • the text recognition result of is the functional text.
  • generating the word vector recognition results corresponding to the multiple word vectors based on the multiple word vectors includes: using the efficacy recognition model to generate the multiple word vectors based on the multiple word vectors The corresponding word vector recognition result, wherein the power recognition model is used to generate the word vector recognition result corresponding to the input word vector based on the input word vector.
  • the efficacy recognition model before using the efficacy recognition model to generate the word vector recognition results corresponding to the multiple word vectors based on the multiple word vectors, it also includes: determining the training text and the corresponding Text recognition results; based on the training text and the text recognition results corresponding to the training text, determine the multiple word vector samples corresponding to the training text and the word vector recognition results corresponding to the multiple word vector samples, establish an initial network model, and based on multiple word vectors The word vector recognition results corresponding to the vector sample and multiple word vector samples respectively train the initial network model to generate an efficacy recognition model.
  • the training text includes a first language training text and a second language training text
  • the first language training text includes text content written in the first language
  • the second language training text includes For text content written in a second language
  • the text recognition result corresponding to the first language training text is the first text recognition result
  • the text recognition result corresponding to the second language training text is the second text recognition result
  • the text recognition results of the text include: obtaining the first language training text and the second language training text; based on the first language training text and the first text recognition result, obtaining the first functional markup text corresponding to the first language training text; based on the first The effect is to mark the text and determine the second text recognition result corresponding to the second language training text.
  • determining the second text recognition result corresponding to the second language training text includes: translating the first efficacy marked text to obtain the first efficacy marked text The corresponding translated text, wherein the translated text is expressed in a second language; a similarity algorithm is used to mark the translated text corresponding to the text based on the first effect, and determine the second text recognition result.
  • the training text is a patent text
  • obtaining the first language training text and the second language training text includes: obtaining patent family text data including a plurality of different languages;
  • the text data screens the first language training text and the second language training text, wherein the first language training text includes efficacy identification paragraph information, and the second language training text includes patent texts to be marked with efficacy.
  • the first language includes Japanese
  • the first language training text includes Japanese patent text
  • a text recognition device provided by an embodiment of the present application includes: a splitting module configured to determine a plurality of character strings corresponding to the text to be recognized based on the text to be recognized, wherein, among the plurality of character strings, adjacent The strings are partially overlapped; the conversion module is configured to convert multiple character strings into word vectors to obtain multiple word vectors, wherein the multiple word vectors are in one-to-one correspondence with multiple strings; the generation module is configured to be based on A plurality of word vectors generates word vector recognition results corresponding to a plurality of word vectors, wherein the word vector recognition results are functional text or non-functional text; and a determination module is configured to determine The text recognition result of the text to be recognized.
  • an embodiment of the present application provides a computer-readable storage medium, the storage medium stores instructions, and when the instructions are executed by the processor of the electronic device, the electronic device can execute the text recognition method of any of the above-mentioned embodiments .
  • an embodiment of the present application provides an electronic device, and the electronic device includes: a processor; a memory for storing computer-executable instructions; a processor for executing computer-executable instructions, so as to implement any of the above implementations An example text recognition method.
  • multiple character strings corresponding to the text to be recognized are determined based on the text to be recognized, and adjacent character strings in the multiple character strings partially overlap, so that multiple character strings can reflect the character strings to be recognized. Recognizes relationships between contexts of text.
  • the text recognition can be performed more carefully, and the text recognition result of the text to be recognized is determined by integrating multiple word vector recognition results, and the recognition The relationship between the contexts of the text to be recognized is realized, and the accuracy of text recognition is improved.
  • the text recognition method of the present application does not require manual indexing, which improves the text recognition efficiency.
  • FIG. 1 is a schematic diagram of an application scenario of a text recognition method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a text recognition method provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • Fig. 4a is a schematic diagram of a method for determining a character string provided by an embodiment of the present application.
  • Fig. 4b is a schematic diagram of a character string determination method provided by another embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • Fig. 6a is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • Fig. 6b is a schematic diagram of a corresponding relationship between a unit to be recognized and a character string provided by an embodiment of the present application.
  • FIG. 6c is a schematic diagram of a corresponding relationship between a unit to be recognized and a character string provided by another embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • Fig. 11a is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • Fig. 11b is a schematic diagram of a first efficacy markup text provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • FIG. 13 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a text recognition device provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a character string determination unit provided by an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of a determination module provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a voting unit provided by an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of a voting operation subunit provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of a generation module provided by an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a text recognition device provided by another embodiment of the present application.
  • FIG. 21 is a schematic structural diagram of a sample preprocessing module provided by an embodiment of the present application.
  • FIG. 22 is a schematic structural diagram of a second sample preprocessing unit provided by an embodiment of the present application.
  • FIG. 23 is a schematic structural diagram of a sample preprocessing module provided by another embodiment of the present application.
  • FIG. 24 is a schematic structural diagram of a first sample preprocessing unit provided by another embodiment of the present application.
  • FIG. 25 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the technical solutions provided in this application can be applied to smart terminals (such as tablet computers, mobile phones, etc.), so that the smart terminals can have related functions, such as the recognition function of functional text, the translation function of text, and the like.
  • the technical solutions provided in this application can be applied to patent retrieval scenarios.
  • the subject to be expressed in the patent document can be judged through the technical efficacy, and the patents can be classified accordingly to improve patent retrieval efficiency. s efficiency.
  • this application uses a deep learning network model to identify technical function paragraphs, which can narrow the scope of finding technical function words.
  • this application can directly present the technical function content of the patent, which can save users the time of reading patent documents and help users quickly understand the subject and technical features of the patent.
  • FIG. 1 is a schematic diagram of an application scenario of a text recognition method provided by an embodiment of the present application.
  • the scenario shown in FIG. 1 includes a server 110 and a client 120 communicatively connected to the server 110 .
  • the server 110 is used to determine multiple word vectors based on the text to be recognized, and the multiple word vectors are used to represent the semantic and grammatical information of the text to be recognized; based on the multiple word vectors, the word vector recognition results corresponding to each of the multiple word vectors are generated , the word vector recognition result is functional text or non-functional text; and the text recognition result of the text to be recognized is determined based on the corresponding word vector recognition results of the plurality of word vectors.
  • the client 120 can receive the text to be recognized input by the user, and send the received text to be recognized to the server 110, and the server 110 generates a text recognition result based on the received text to be recognized, and generates The text recognition results are sent to the client 120, and the client 120 presents the received text recognition results to the user.
  • FIG. 2 is a schematic flowchart of a text recognition method provided by an embodiment of the present application. As shown in FIG. 2 , the text recognition method provided by the embodiment of the present application includes the following steps.
  • Step 210 determine multiple character strings corresponding to the text to be recognized based on the text to be recognized.
  • the text to be recognized is: "The utility model relates to the technical field of agricultural machinery, and specifically relates to a structure for the overall configuration of the protective cover of the threshing part of a crawler harvester. It includes the threshing frame and the overall protective cover.”
  • the multiple character strings corresponding to the text to be recognized can be: "Span1: the utility model relates to the field of agricultural machinery technology", “Span2: the field of mechanical technology, specifically related to a crawler harvester threshing department protection”, “Span3: the threshing department The structure of the overall configuration of the guard. Including the threshing rack”, "Span4: Structure.
  • Span represents the string
  • span1 represents the first string
  • span2 represents the second string
  • Each character string may include the same number of characters or a different number of characters, which is not specifically limited in this application.
  • Adjacent character strings may include part of the same characters, for example, the first character string and the second character string both include the following character "mechanical technology field”.
  • the text to be recognized may be a sentence, a paragraph, or a piece of text, which is not specifically limited in this application.
  • Step 220 performing word-vector conversion on multiple character strings to obtain multiple word vectors.
  • Word2vec word to vector
  • Word2vec is a model used to generate word vectors. Input the string into Word2vec to get the corresponding word vector.
  • Step 230 generating word vector recognition results corresponding to the multiple word vectors based on the multiple word vectors.
  • the word vector recognition result is functional text or non-functional text.
  • Each word vector corresponds to a word vector recognition result.
  • Functional texts are texts representing technical effects
  • non-functional texts are texts other than texts representing technical effects.
  • the text to be recognized may be the patent text
  • the functional text is the text representing the technical effect in the patent text
  • the non-functional text is other text in the patent text except the text representing the technical effect.
  • Step 240 Determine the text recognition result of the text to be recognized based on the word vector recognition results corresponding to each of the multiple word vectors.
  • the text recognition result of the text to be recognized may be that the sentence is functional text or non-functional text. If the text to be recognized is a paragraph, the text recognition result of the text to be recognized may be that the paragraph is functional text or nonfunctional text. If the text to be recognized is a piece of text, the text recognition result of the text to be recognized may be that a sentence or a paragraph in the text is marked as functional text.
  • multiple word vectors can be obtained, and multiple word vectors can be used to comprehensively represent the text to be recognized, providing accurate data support for text recognition .
  • multiple character strings corresponding to the text to be recognized are determined based on the text to be recognized, and adjacent character strings in the multiple character strings partially overlap, so that multiple character strings can reflect the character strings to be recognized. Recognizes relationships between contexts of text.
  • the text recognition can be performed more carefully, and the text recognition result of the text to be recognized is determined by integrating multiple word vector recognition results, and the recognition The relationship between the contexts of the text to be recognized is realized, and the accuracy of text recognition is improved.
  • the text recognition method of the present application does not require manual indexing, which improves the text recognition efficiency.
  • FIG. 3 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in Fig. 3 of the present application is extended, and the difference between the embodiment shown in Fig. 3 and the embodiment shown in Fig. 2 will be emphasized below, and the similarities will not be repeated.
  • the step of determining multiple character strings corresponding to the text to be recognized based on the text to be recognized includes the following steps.
  • Step 310 determine the first character string with a preset character string length based on the text to be recognized.
  • the number of multiple character strings is M, and M is a positive integer greater than or equal to 1.
  • the preset character string length can be a preset character string length, for example, it can be 256 characters or 128 characters, and the preset character string length can be selected according to actual needs, which is not specifically limited in this application.
  • Step 320 using the characters in the Nth character string as the starting character of the N+1th character string, decomposing the text to be recognized based on the preset character string length and the preset decomposition step, to obtain the N+1th character string.
  • N is a positive integer greater than or equal to 1 and less than M.
  • the preset decomposition step can be 128 characters or 64 characters, and the preset decomposition step can be selected according to actual needs, which is not specifically limited in this application.
  • a sliding window method may be used to determine multiple character strings.
  • the text to be recognized is: "The utility model relates to the technical field of agricultural machinery, and specifically relates to a structure for the overall configuration of the protective cover of the threshing section of a crawler harvester. It includes the threshing frame and the overall protective cover.” .
  • the preset character string length of the sliding window may be 20 characters, and the preset decomposition step of the sliding window may be 10 characters.
  • the multiple character strings corresponding to the text to be recognized are: “Span1: the utility model relates to agricultural machinery”, “Span2: relates to the field of agricultural machinery technology”, “Span3: the field of mechanical technology, specifically involves”, “Span4: specifically relates to a crawler type”, “Span5: a crawler harvester threshing”, “Span6: the whole protective cover of the threshing part of the harvester”, “Span7: the knot of the overall configuration of the protective cover”, “Span8: the structure of the body configuration. Including Detach”, “Span9: structure. Including the threshing rack, also”, “Span10: granulation rack, also including the overall anti”, “Span11: including the overall protective cover. Padding Padding”. That is, in FIG.
  • the number M of character strings is 11. If N is 1, the first character string is Span1, and the N+1th character string is Span2, that is, a certain character in Span1 is the starting character of Span2.
  • "Padding" is an automatic placeholder. If the length of the last string is less than the preset string length, the automatic placeholder can be used to supplement the length of the string so that the length of the last string is equal to the preset string length .
  • a rectangle in Figure 4b represents a character string, and a hatched rectangle represents the overlapping portion of adjacent character strings.
  • the N+1th character string can be obtained, so that each character string includes the same number of characters, and adjacent character strings include the same number of overlapping characters, which improves the uniformity of character string decomposition and better representation
  • the semantic and grammatical relationship between character strings is clarified, thus providing more accurate data support for text recognition.
  • FIG. 5 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in FIG. 5 of the present application is extended. The difference between the embodiment shown in FIG. 5 and the embodiment shown in FIG. 2 will be emphasized below, and the similarities will not be repeated.
  • the step of determining the text recognition result of the text to be recognized based on the word vector recognition results corresponding to each of the multiple word vectors includes the following steps.
  • Step 510 using a voting mechanism to perform a voting operation on the word vector recognition results corresponding to each of the multiple word vectors, and determine the text recognition result of the text to be recognized.
  • the text recognition result of the text to be recognized is determined to be effective text. If, in the word vector recognition result corresponding to the text to be recognized, the number of functional texts is less than the number of non-functional texts, it is determined that the text recognition result of the unit to be recognized is non-functional text.
  • the word vector recognition results corresponding to the multiple word vectors are voted to determine the text recognition results of the text to be recognized, and the to-be-recognized text can be identified. Recognizing the relationship between the contexts of the text improves the accuracy of text recognition, while the voting mechanism conforms to the voting principle of the minority obeying the majority, which further improves the accuracy of text recognition.
  • Fig. 6a is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in Figure 6a of the present application is extended, and the difference between the embodiment shown in Figure 6a and the embodiment shown in Figure 5 will be emphasized below, and the similarities will not be repeated.
  • the step of using a voting mechanism to perform a voting operation on word vector recognition results corresponding to multiple word vectors to determine the text recognition result of the text to be recognized includes the following steps.
  • Step 610 Determine multiple units to be recognized corresponding to the text to be recognized based on the text to be recognized.
  • the unit to be identified corresponds to at least one word vector.
  • the unit to be recognized may be a sentence or a paragraph in the text.
  • a character string may be a text fragment obtained by decomposing the sentence or paragraph.
  • the text to be recognized is a patent text
  • the unit to be recognized may be at least one of a sentence, a paragraph, and a text module in the patent text.
  • the text module can be at least one of an abstract module, a claim module and a description module.
  • Step 620 for each unit to be recognized among the plurality of units to be recognized, adopt a voting mechanism to perform a voting operation on the word vector recognition result corresponding to the unit to be recognized, and determine the text recognition result of the unit to be recognized.
  • the text recognition result of the unit to be recognized may be functional text or non-functional text.
  • T is a unit to be identified.
  • the correspondence between a unit to be recognized and a character string can be as follows. The first one, as shown in Figure 6b, a character string contains a unit to be recognized. In the second case, there is no corresponding relationship between a character string and a unit to be recognized. As shown in FIG. 6c, the relationship between Span1 and Span7 and T belongs to the second case. In the third case, a character string overlaps with a unit to be recognized or a unit to be recognized contains at least one character string, as shown in FIG. 6c, the relationship between Span2 to Span6 and T belongs to the third case.
  • the text recognition result of the unit to be recognized depends on the word vector recognition result corresponding to the character string, that is, if the word vector recognition result corresponding to the character string is functional text, then the text recognition result of the unit to be recognized The result is the functional text. If the word vector recognition result corresponding to the character string is non-functional text, the text recognition result of the unit to be recognized is non-functional text.
  • the text recognition result of the unit to be recognized depends on the word vector recognition result corresponding to the part overlapping with a unit to be recognized and the character string contained in the unit to be recognized.
  • the text recognition result of the unit T to be recognized depends on the word vector recognition results of Span2 to Span6.
  • a voting mechanism needs to be adopted to perform a voting operation on the word vector recognition results corresponding to the units to be recognized to determine the text recognition results of the units to be recognized.
  • Step 630 based on the text recognition results corresponding to each of the plurality of units to be recognized, determine the text recognition result corresponding to the text to be recognized.
  • the text recognition result corresponding to the text to be recognized may be that some sentences or paragraphs are marked as functional texts.
  • determining the text recognition result of the unit to be recognized and based on the respective text recognition results corresponding to a plurality of units to be recognized, determining the text recognition result corresponding to the text to be recognized, it is possible to recognize whether multiple parts in a text are functional texts, In this way, the part belonging to the function text is marked, which is convenient for the user to view.
  • FIG. 7 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in FIG. 7 of the present application is extended. The difference between the embodiment shown in FIG. 7 and the embodiment shown in FIG. 6 will be emphasized below, and the similarities will not be repeated. .
  • a voting mechanism is adopted to perform a voting operation on the word vector recognition result corresponding to the unit to be recognized, and the steps of determining the text recognition result of the unit to be recognized include the following step.
  • Step 710 for each unit to be recognized among the plurality of units to be recognized, if in the word vector recognition result corresponding to the unit to be recognized, the number of functional texts is greater than or equal to the number of non-effective texts, determine the text recognition of the unit to be recognized The result is the efficacy text.
  • Step 720 for each unit to be recognized among the plurality of units to be recognized, if in the word vector recognition result corresponding to the unit to be recognized, the number of functional texts is less than the number of non-functional texts, determine that the text recognition result of the unit to be recognized is Nonfunctional text.
  • the text recognition result of the unit to be recognized is the comparison result of the number of functional texts and the number of non-functional texts in the word vector recognition results of Span2 to Span6.
  • the word vector recognition result of Span2 is functional text
  • the word vector recognition result of Span3 is functional text
  • the word vector recognition result of Span4 is functional text
  • the word vector recognition result of Span5 is non-functional text
  • the word vector recognition result of Span6 is Non-functional texts, that is, there are 3 functional texts and 2 non-functional texts, then the number of functional texts is greater than the number of non-functional texts
  • the text recognition result of the unit to be recognized is the functional text.
  • a majority voting principle is adopted to determine the text recognition result of the unit to be recognized, which further improves the accuracy of text recognition.
  • FIG. 8 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in FIG. 8 of the present application is extended. The difference between the embodiment shown in FIG. 8 and the embodiment shown in FIG. 2 will be emphasized below, and the similarities will not be repeated.
  • the step of generating word vector recognition results corresponding to the multiple word vectors based on the multiple word vectors includes the following steps.
  • Step 810 using the efficacy recognition model to generate word vector recognition results corresponding to the multiple word vectors based on the multiple word vectors.
  • the power recognition model is used to generate a word vector recognition result corresponding to the input word vector based on the input word vector.
  • FIG. 9 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in FIG. 9 of the present application is extended. The difference between the embodiment shown in FIG. 9 and the embodiment shown in FIG. 8 will be emphasized below, and the similarities will not be repeated. .
  • the following steps are further included.
  • Step 910 determine the training text and the text recognition result corresponding to the training text.
  • the training text mentioned in step 910 corresponds to the text to be recognized in the above-mentioned embodiment.
  • Step 920 based on the training text and the text recognition results corresponding to the training text, determine a plurality of word vector samples corresponding to the training text and word vector recognition results corresponding to each of the plurality of word vector samples.
  • Determining the multiple word vector recognition results corresponding to the text recognition results may be to mark the word vector recognition results of the word vectors corresponding to the functional texts in the text recognition results as functional texts, and mark the word vectors corresponding to the non-functional texts in the text recognition results The word vector recognition results of are marked as non-effective text.
  • step 930 an initial network model is established, and the initial network model is trained based on multiple word vector samples and word vector recognition results corresponding to the multiple word vector samples, so as to generate an efficacy recognition model.
  • the efficacy recognition model mentioned in step 930 is used to generate a word vector recognition result corresponding to the input word vector based on the input word vector.
  • the initial network model can be a BERT model.
  • BERT is an open source pre-trained language model.
  • the pre-trained language model is trained on a wide range of data sets. Therefore, the BERT model is a model with a certain basic prior knowledge of the language.
  • the BERT model is used as the initial network model for training.
  • the BERT model continuously adjusts the parameters in the model framework. Through continuous iterative adjustment, the BERT model achieves the optimal effect, improves the learning efficiency of the initial network model, and improves the accuracy of the efficacy recognition model.
  • FIG. 10 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in FIG. 10 of the present application is extended. The difference between the embodiment shown in FIG. 10 and the embodiment shown in FIG. 9 will be emphasized below, and the similarities will not be repeated.
  • the step of determining the training text and the text recognition result corresponding to the training text includes the following steps.
  • Step 1010 acquiring the first language training text and the second language training text.
  • Step 1020 based on the first language training text and the first text recognition result, obtain the first efficacy marked text corresponding to the first language training text.
  • the first efficacy marking text may be efficacy marking text.
  • the training text includes a first language training text and a second language training text
  • the first language training text and the second language training text include equivalent content written in different languages
  • the first language training text includes using the first
  • the second language training text includes the text content written in the second language
  • the text recognition result corresponding to the first language training text is the first text recognition result
  • the text recognition result corresponding to the second language training text is the second Two text recognition results.
  • the text recognition result includes a first text recognition result and a second text recognition result.
  • the training text is a patent text
  • the first language training text and the second language training text belong to the same patent family.
  • the first language includes Japanese
  • the first language training text includes Japanese patent text.
  • Japanese patent texts are marked with "invention effects", for example, " ⁇ effect”.
  • Japanese patent texts can be screened out first, and then Japanese patent texts marked with "invention effect” can be screened out.
  • the first effect mark text may be the text of the effect part marked with "invention effect”.
  • Step 1030 Determine a second text recognition result corresponding to the second language training text based on the first efficacy tagged text.
  • the second language training text quickly determines the second text recognition result, and improves the determination efficiency of the training text and the text recognition result corresponding to the training text.
  • Fig. 11a is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in FIG. 11a of the present application is extended. The difference between the embodiment shown in FIG. 11a and the embodiment shown in FIG. 10 will be emphasized below, and the similarities will not be repeated.
  • the step of determining the second text recognition result corresponding to the second language training text based on the first efficacy tagged text includes the following steps.
  • Step 1110 translate the first function marked text to obtain the translated text corresponding to the first function marked text.
  • the translated text is expressed in a second language.
  • the first language can be Japanese.
  • the first effect mark text in the first language may be the text marked with the effect part of " ⁇ effect".
  • the second language can be Chinese, English, or other languages, which are not specifically limited in this application.
  • Step 1120 using a similarity algorithm to determine a second text recognition result based on the translated text corresponding to the first efficacy tagged text.
  • the sentence or paragraph with the highest similarity to the first functional tagged text of the second language in the second language training text can be obtained, that is, the sentence or paragraph with the first functional tagged text of the second language
  • the first function of the second language marks the sentence or paragraph with the highest text similarity as the second text recognition result of the second language training text, thereby quickly determining the second text recognition result and improving the text recognition corresponding to the training text and the training text The determination efficiency of the result.
  • the step of translating the first efficacy-marked text to obtain the translated text corresponding to the first efficacy-marked text includes the following steps.
  • the efficiency of text recognition is further improved by using the text translation model to translate the first functionally marked text in the first language into the first functionally marked text in the second language.
  • FIG. 12 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the step of acquiring the first language training text and the second language training text includes the following steps.
  • Step 1210 acquire patent family text data including multiple different languages.
  • the patent family text data may be simple family text data.
  • Step 1220 screening the first language training text and the second language training text based on the patent family text data.
  • the first language training text includes efficacy identification paragraph information.
  • the second language training text includes the patent text to be marked with efficacy.
  • the training text can be obtained through the following steps:
  • each group of family patents includes Japanese patents and Chinese patents.
  • FIG. 13 is a schematic flowchart of a text recognition method provided by another embodiment of the present application.
  • the embodiment shown in FIG. 13 of the present application is extended. The difference between the embodiment shown in FIG. 13 and the embodiment shown in FIG. 10 will be emphasized below, and the similarities will not be repeated. .
  • the following steps are further included.
  • Step 1310 determine the second efficacy marked text corresponding to the second language training text.
  • the second language training text may also include the second efficacy marking text, for example, some Chinese patents may also include the text of the effect part marked with "invention effect".
  • the second effect mark text may be the text of the effect part marked with "invention effect", that is, the effect mark text.
  • Step 1320 Determine the accuracy of the second text recognition result based on the second text recognition result and the second efficacy tagged text.
  • accuracy can be precision A, or recall rate R.
  • W 1 is the number of functional paragraphs or functional sentences in the second text recognition results of the second language training text in all training texts
  • W 2 is the second functional tagged text of the second language training text in all training texts The number of utility paragraphs or utility sentences.
  • the present application obtains through a comparative test that the precision and the recall rate obtained by the method of the present application are both greater than 90%.
  • the method of this application only needs hundreds of thousands of training texts to achieve satisfactory results.
  • Step 1330 adjust the training text according to the accuracy of the second text recognition result.
  • the training text can be adjusted, for example, the training text can be reselected.
  • the quality of the training text is improved, thereby improving the quality of the power recognition model, and further improving the accuracy of text recognition.
  • FIG. 14 is a schematic structural diagram of a text recognition device provided by an embodiment of the present application.
  • the text recognition device 1400 provided in the embodiment of the present application includes:
  • the splitting module 1410 is configured to determine a plurality of character strings corresponding to the text to be recognized based on the text to be recognized, wherein adjacent character strings among the plurality of character strings partially overlap;
  • the conversion module 1420 is configured to convert a plurality of character strings into word vectors to obtain a plurality of word vectors, wherein the plurality of word vectors are in a one-to-one correspondence with the plurality of character strings;
  • the generation module 1430 is configured to generate word vector recognition results corresponding to the multiple word vectors based on the multiple word vectors, wherein the word vector recognition results are functional text or non-functional text;
  • the determining module 1440 is configured to determine the text recognition result of the text to be recognized based on the word vector recognition results corresponding to each of the multiple word vectors.
  • FIG. 15 is a schematic structural diagram of a character string determination unit provided by an embodiment of the present application.
  • the embodiment shown in FIG. 15 of the present application is extended on the basis of the embodiment shown in FIG. 14 of the present application.
  • the splitting module 1410 includes:
  • the first character string determination unit 1511 is configured to determine the first character string of a preset character string length based on the text to be recognized;
  • the Nth character string determining unit 1512 is configured to use the characters in the Nth character string as the starting character of the N+1th character string, decompose the text to be recognized based on the preset character string length and the preset decomposition step, and obtain The N+1th string, where N is a positive integer greater than or equal to 1 and less than M.
  • FIG. 16 is a schematic structural diagram of a determination module provided by an embodiment of the present application.
  • the embodiment shown in FIG. 16 of the present application is extended on the basis of the embodiment shown in FIG. 14 of the present application.
  • the difference between the embodiment shown in FIG. 16 and the embodiment shown in FIG. 14 will be described below, and the similarities will not be repeated.
  • the determining module 1440 includes:
  • the voting determination unit 1441 is configured to use a voting mechanism to perform a voting operation on the word vector recognition results corresponding to the multiple word vectors, and determine the text recognition result of the text to be recognized.
  • FIG. 17 is a schematic structural diagram of a voting unit provided by an embodiment of the present application.
  • the embodiment shown in FIG. 17 of the present application is extended on the basis of the embodiment shown in FIG. 16 of the present application.
  • the differences between the embodiment shown in FIG. 17 and the embodiment shown in FIG. 16 will be emphasized below, and the similarities will not be repeated.
  • the voting determination unit 1441 includes:
  • the pre-voting processing subunit 1711 is configured to determine a plurality of units to be recognized corresponding to the text to be recognized based on the text to be recognized, wherein the unit to be recognized corresponds to at least one word vector;
  • the voting subunit 1712 is configured to use a voting mechanism for each unit to be recognized among the plurality of units to be recognized, and perform a voting operation on the word vector recognition result corresponding to the unit to be recognized, and determine the text recognition result of the unit to be recognized;
  • the post-voting processing subunit 1713 is configured to determine the text recognition result corresponding to the text to be recognized based on the text recognition results corresponding to the plurality of units to be recognized.
  • FIG. 18 is a schematic structural diagram of a voting operation subunit provided by an embodiment of the present application.
  • the embodiment shown in FIG. 18 of the present application is extended on the basis of the embodiment shown in FIG. 17 of the present application.
  • the differences between the embodiment shown in FIG. 18 and the embodiment shown in FIG. 17 will be emphasized below, and the similarities will not be repeated.
  • the voting subunit 1712 includes:
  • the voting operation subunit 1811 is configured to, for each unit to be recognized in a plurality of units to be recognized, if in the word vector recognition result corresponding to the unit to be recognized, the number of functional texts is greater than or equal to the number of non-functional texts, determine the number of texts to be recognized
  • the text recognition result of the recognition unit is the functional text.
  • FIG. 19 is a schematic structural diagram of a generation module provided by an embodiment of the present application.
  • the embodiment shown in FIG. 19 of the present application is extended on the basis of the embodiment shown in FIG. 14 of the present application.
  • the difference between the embodiment shown in FIG. 19 and the embodiment shown in FIG. 14 will be described below, and the similarities will not be repeated.
  • the generation module 1430 includes:
  • the generation unit 1431 is configured to use the efficacy recognition model to generate word vector recognition results corresponding to the multiple word vectors based on the multiple word vectors, wherein the efficacy recognition model is used to generate word vectors corresponding to the input word vectors based on the input word vectors recognition result.
  • FIG. 20 is a schematic structural diagram of a text recognition device provided by another embodiment of the present application.
  • the embodiment shown in FIG. 20 of the present application is extended on the basis of the embodiment shown in FIG. 19 of the present application.
  • the difference between the embodiment shown in FIG. 20 and the embodiment shown in FIG. 19 will be described in the following, and the similarities will not be repeated.
  • the text recognition device 1400 provided in the embodiment of the present application further includes:
  • the sample preprocessing module 1450 is configured to determine the training text and the text recognition result corresponding to the training text;
  • the sample determination module 1460 is configured to determine a plurality of word vector samples corresponding to the training text and word vector recognition results corresponding to each of the plurality of word vector samples based on the training text and the text recognition results corresponding to the training text;
  • the model determination module 1470 is configured to establish an initial network model, and train the initial network model based on multiple word vector samples and word vector recognition results corresponding to the multiple word vector samples, so as to generate an efficacy recognition model.
  • FIG. 21 is a schematic structural diagram of a sample preprocessing module provided by an embodiment of the present application.
  • the embodiment shown in FIG. 21 of the present application is extended on the basis of the embodiment shown in FIG. 20 of the present application.
  • the sample preprocessing module 1450 includes:
  • the first sample preprocessing unit 1451 is configured to acquire the first language training text and the second language training text;
  • the second sample preprocessing unit 1452 based on the first language training text and the first text recognition result, obtains the first efficacy markup text corresponding to the first language training text;
  • the third sample preprocessing unit 1453 is configured to determine a second text recognition result corresponding to the training text in the second language based on the first functional markup text.
  • FIG. 22 is a schematic structural diagram of a second sample preprocessing unit provided by an embodiment of the present application.
  • the embodiment shown in FIG. 22 of this application is extended on the basis of the embodiment shown in FIG. 21 of this application.
  • the differences between the embodiment shown in FIG. 22 and the embodiment shown in FIG. 21 will be described below, and the similarities will not be repeated.
  • the second sample preprocessing unit 1452 includes:
  • the translation subunit 2210 is configured to translate the first function marked text to obtain the translated text corresponding to the first function marked text, wherein the translated text is expressed in a second language;
  • the similarity determining subunit 2220 is configured to use a similarity algorithm to determine the second text recognition result based on the translated text corresponding to the first efficacy tagged text.
  • the translation subunit 2210 is further configured to input the first function marked text into the text translation model, so as to generate the translated text corresponding to the first function marked text.
  • FIG. 23 is a schematic structural diagram of a sample preprocessing module provided by another embodiment of the present application.
  • the embodiment shown in FIG. 23 of the present application is extended on the basis of the embodiment shown in FIG. 21 of the present application.
  • the differences between the embodiment shown in FIG. 23 and the embodiment shown in FIG. 21 will be focused on below, and the similarities will not be repeated.
  • the sample preprocessing module 1450 also includes:
  • the fourth sample preprocessing unit 1454 is configured to determine the second efficacy marked text corresponding to the second language training text
  • the comparison unit 1455 is configured to determine the accuracy of the second text recognition result based on the second text recognition result and the second efficacy tagged text;
  • the adjustment unit 1456 is configured to adjust the training text according to the accuracy of the second text recognition result.
  • FIG. 24 is a schematic structural diagram of a first sample preprocessing unit provided by another embodiment of the present application.
  • the embodiment shown in FIG. 24 of this application is extended on the basis of the embodiment shown in FIG. 21 of this application.
  • the differences between the embodiment shown in FIG. 24 and the embodiment shown in FIG. 21 will be described below, and the similarities will not be repeated.
  • the first sample preprocessing unit 1451 further includes:
  • the sample acquisition subunit 2410 is configured to acquire patent family text data in multiple different languages.
  • the screening subunit 2420 is configured to screen the first language training text and the second language training text based on the patent family text data.
  • FIG. 25 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 250 includes: one or more processors 2501 and a memory 2502; and computer program instructions stored in the memory 2502, the computer program instructions cause the processor 2501 to perform as The text recognition method of any one of the above embodiments.
  • the processor 2501 may be a central processing unit (CPU) or other forms of processing units having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
  • CPU central processing unit
  • Memory 2502 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include random access memory (RAM) and/or cache memory (cache), etc., for example.
  • Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, and the like.
  • One or more computer program instructions can be stored on the computer-readable storage medium, and the processor 2501 can execute the program instructions to realize the steps in the above text recognition methods of the various embodiments of the present application and/or other desired functions .
  • the electronic device 250 may further include: an input device 2503 and an output device 2504, and these components are interconnected through a bus system and/or other forms of connection mechanisms (not shown in FIG. 25 ).
  • the input device 2503 may also include, for example, a keyboard, a mouse, a microphone, and the like.
  • the output device 2504 can output various information to the outside, and can include, for example, a display, a speaker, a printer, a communication network and a remote output device connected thereto, and the like.
  • the electronic device 250 may also include any other suitable components.
  • the embodiments of the present application may also be computer program products, including computer program instructions.
  • the processor executes the steps in the text recognition method of any of the above-mentioned embodiments. .
  • the computer program product can write program codes for executing the operations of the embodiments of the present application in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages, such as Java, C++, etc., and also include conventional A procedural programming language such as "C" or similar programming language.
  • the program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server to execute.
  • embodiments of the present application may also be a computer-readable storage medium on which computer program instructions are stored, and when executed by a processor, the computer program instructions cause the processor to execute the method described in the above-mentioned "Exemplary Method" section of this specification. Steps in the text recognition method in various embodiments of the present application.
  • the computer readable storage medium may utilize any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may include, but not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), computer Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • each component or each step can be decomposed and/or reassembled. These decompositions and/or recombinations should be considered equivalents of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Tourism & Hospitality (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Medical Informatics (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Economics (AREA)
  • Biomedical Technology (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Primary Health Care (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • Character Discrimination (AREA)

Abstract

本申请涉及文本识别技术领域,具体涉及一种文本识别方法和文本识别装置,以及计算机可读存储介质和电子设备,解决了文本识别不准确和效率低的问题。该文本识别方法,通过使多个字符串中相邻的字符串有部分重叠,从而可以使多个字符串体现待识别文本的上下文之间的关系。然后,对多个字符串进行词向量转化,得到多个词向量,基于多个词向量生成多个词向量各自对应的词向量识别结果,以确定词向量对应的文本是功效文本还是非功效文本,综合多个词向量识别结果来确定待识别文本的文本识别结果,识别到了待识别文本的上下文之间的关系,提高了文本识别的准确性。另外,本申请的文本识别方法不需要人工标引,提高了文本识别的效率。

Description

文本识别方法及装置、计算机可读存储介质和电子设备 技术领域
本申请涉及文本处理技术领域,具体涉及一种文本识别方法及装置、计算机可读存储介质和电子设备。
发明背景
专利文献主要包括技术问题、技术方案和技术功效三个部分。根据专利文献的技术功效部分可以对专利文献进行详细的分类。
目前,有利用专利的著录项目信息来识别功效文本,从而对专利文件进行分类的方法。但是专利的著录项目信息过于宽泛,不能准确的识别出功效文本,无法对专利文献进行详细的分类。相关技术中,为了提高识别功效文本的准确度,主要通过规则标引或人工标引的方式来确定功效文本。规则标引是通过识别具体的语法模式来确定功效文本,容易遗漏语法模式无法覆盖的其他表达,从而可能遗漏重要的专利信息,导致文本识别的准确性较低。人工标引虽然准确性较高,但是需要耗费大量的人力,导致文本识别的效率较低。
发明内容
有鉴于此,本申请实施例提供了一种文本识别方法及装置、计算机可读存储介质和电子设备,解决了文本识别不准确和效率低的问题。
第一方面,本申请一实施例提供的一种文本识别方法,包括:基于待识别文本确定所述待识别文本对应的多个字符串,其中,所述多个字符串中相邻的所述字符串有部分重叠;对所述多个字符串进行词向量转化,得到多个词向量,其中,所述多个词向量与所述多个字符串呈一一对应关系;基于多个词向量生成多个词向量各自对应的词向量识别结果,其中,词向量识别结果为功效文本或非功效文本;以及基于多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果。
结合本申请的第一方面,在一些实施例中,多个字符串的数量为M,M为大于1的正整数,基于待识别文本确定待识别文本对应的多个字符串,包括:基于待识别文本确定预设字符串长度的第1个字符串;以第N个字符串中的字符为第N+1个字符串的起点字符,基于预设字符串长度和预设分解步长分解待识别文本,得到第N+1个字符串,其中,N为大于或等于1,且小于M的正整数。
结合本申请的第一方面,在一些实施例中,基于多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果,包括:采用投票机制,对多个词向量各自对应的词向量识别结果进行投票操作,确定待识别文本的文本识别结果。
结合本申请的第一方面,在一些实施例中,采用投票机制,对多个词向量各自对应的词向量识别结果进行投票操作,确定待识别文本的文本识别结果,包括:基于待识别文本确定待识别文本对应的多个待识别单元,其中,待识别单元对应至少一个词向量;针对多个待识别单元中的每个待识别单元,采用投票机制,对待识别单元对应的词向量识别结果进行投票操作,确定待识别单元的文本识别结果;基于多个待识别单元各自对应的文本识别结果,确定待识别文本对应的文本识别结果。
结合本申请的第一方面,在一些实施例中,待识别文本为专利文本,待识别单元包括专利文本中的句子、段落和文本模块中的至少一种;其中,文本模块包括摘要模块、权利要求书模块和说明书模块中的至少一种。
结合本申请的第一方面,在一些实施例中,针对多个待识别单元中的每个待识别单元,采用投票机制,对待识别单元对应的词向量识别结果进行投票操作,确定待识别单元的文本识别结果,包括:针对多个待识别单元中的每个待识别单元,如果在待识别单元对应的词向量识别结果中,功效文本的数量大于或等于非功效文本的数量,确定待识别单元的文本识别结果为功效文本。
结合本申请的第一方面,在一些实施例中,基于多个词向量生成多个词向量各自对应的词向量识别结果,包括:利用功效识别模型,基于多个词向量生成多个词向量各自对应的词向量识别结果,其中,功效识别模型用于基于输入的词向量生成输入的词向量对应的词向量识别结果。
结合本申请的第一方面,在一些实施例中,在利用功效识别模型,基于多个词向量生成多个词向量各自对应的词向量识别结果之前,还包括:确定训练文本以及训练文本对应的文本识别结果;基于训练文本和训练文本对应的文本识别结果,确定训练文本对应的多个词向量样本以及多个词向量样本各自对应的词向量识别结果,建立初始网络模型,并基于多个词向量样本和多个词向量样本各自对应的词向量识别结果训练初始网络模型,以生成功效识别模型。
结合本申请的第一方面,在一些实施例中,训练文本包括第一语言训练文本和第二语言训练文本,第一语言训练文本包括使用第一语言撰写的文本内容,第二语言训练文本包括使用第二语言撰写的文本内容,第一语言训练文本对应的文本识别结果为第一文本识别结果,第二语言训练文本对应的文本识别结果为第二文本识别结果;确定训练文本以及训练文本对应的文本识别结果,包括:获取第一语言训练文本和第二语言训练文本;基于第一语言训练文本和第一文本识别结果,得到第一语言训练文本对应的第一功效标记文本;基于第一功效标记文本,确定第二语言训练文本对应的第二文本识别结果。
结合本申请的第一方面,在一些实施例中,基于第一功效标记文本,确定第二语言训练文本对应的第二文本识别结果,包括:翻译第一功效标记文本,得到第一功效标记文本对应的翻译文本,其中,翻译文本利用第二语言表达;采用相似度算法,基于第一功效标记文本对应的翻译文本,确定第二文本识别结果。
结合本申请的第一方面,在一些实施例中,训练文本为专利文本,获取第一语言训练文本和第二语言训练文本,包括:获取包括多种不同语言的专利家族文本数据;基于专利家族文本数据筛选第一语言训练文本和第二语言训练文本,其中,第一语言训练文本包括功效标识段落信息,第二语言训练文本包括待进行功效标记的专利文本。
结合本申请的第一方面,在一些实施例中,第一语言包括日语,第一语言训练文本包括日本专利文本。
第二方面,本申请一实施例提供的一种文本识别装置,包括:拆分模块,配置为基于待识别文本确定待识别文本对应的多个字符串,其中,多个字符串中相邻的字符串有部分重叠;转化模块,配置为对多个字符串进行词向量转化,得到多个词向量,其中,多个词向量与多个字符串呈一一对应关系;生成模块,配置为基于多个词向量生成多个词向量各自对应的词向量识别结果,其中,词向量识别结果为功效文本或非功效文本;以及确定模块,配置为基于多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果。
第三方面,本申请一实施例提供了一种计算机可读存储介质,存储介质存储有指令,当指令由电子设备的处理器执行时,使得电子设备能够执行上述任一实施例的文本识别方法。
第四方面,本申请一实施例提供了一种电子设备,电子设备包括:处理器;用于存储计算机可执行指令的存储器;处理器,用于执行计算机可执行指令,以实现上述任一实施例的文本识别方法。
本申请实施例提供的文本识别方法,通过基于待识别文本确定待识别文本对应的多 个字符串,且多个字符串中相邻的字符串有部分重叠,从而可以使多个字符串体现待识别文本的上下文之间的关系。然后,对多个字符串进行词向量转化,得到多个词向量,基于多个词向量生成多个词向量各自对应的词向量识别结果,以确定词向量对应的文本是功效文本还是非功效文本,并根据多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果,可以更加细致的进行文本识别,并且综合多个词向量识别结果来确定待识别文本的文本识别结果,识别到了待识别文本的上下文之间的关系,提高了文本识别的准确性。另外,本申请的文本识别方法不需要人工标引,提高了文本识别效率。
附图简要说明
图1所示为本申请一实施例提供的文本识别方法的应用场景示意图。
图2所示为本申请一实施例提供的一种文本识别方法的流程示意图。
图3所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图4a所示为本申请一实施例提供的一种字符串确定方法的示意图。
图4b所示为本申请另一实施例提供的一种字符串确定方法的示意图。
图5所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图6a所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图6b所示为本申请一实施例提供的一种待识别单元与字符串的对应关系的示意图。
图6c所示为本申请另一实施例提供的一种待识别单元与字符串的对应关系的示意图。
图7所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图8所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图9所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图10所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图11a所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图11b所示为本申请一实施例提供的一种第一功效标记文本的示意图。
图12所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图13所示为本申请另一实施例提供的一种文本识别方法的流程示意图。
图14所示为本申请一实施例提供的一种文本识别装置的结构示意图。
图15所示为本申请一实施例提供的一种字符串确定单元的结构示意图。
图16所示为本申请一实施例提供的一种确定模块的结构示意图。
图17所示为本申请一实施例提供的一种投票单元的结构示意图。
图18所示为本申请一实施例提供的一种投票操作子单元的结构示意图。
图19所示为本申请一实施例提供的一种生成模块的结构示意图。
图20所示为本申请另一实施例提供的一种文本识别装置的结构示意图。
图21所示为本申请一实施例提供的一种样本预处理模块的结构示意图。
图22所示为本申请一实施例提供的一种第二样本预处理单元的结构示意图。
图23所示为本申请另一实施例提供的一种样本预处理模块的结构示意图。
图24所示为本申请另一实施例提供的一种第一样本预处理单元的结构示意图。
图25所示为本申请一实施例提供的电子设备的结构示意图。
实施本发明的方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其 他实施例,都属于本申请保护的范围。
本申请提供的技术方案可以应用在智能终端(比如平板电脑、手机等)中,以使智能终端具备相关功能,比如,功效文本的识别功能、文本的翻译功能等。
示例性地,本申请提供的技术方案可以应用于专利检索场景中。具体地,在专利检索场景中,特别是在专利功效检索场景中,利用本申请提供的技术方案能够通过技术功效来判断专利文献所要表达的主题,能对专利进行相应的分类,以提高专利检索的效率。可选地,本申请利用深度学习网络模型识别技术功效段落,可以缩小寻找技术功效词的范围。可选地,本申请能够直接呈现专利的技术功效内容,可以为用户节省阅读专利文献的时间,帮助用户快速理解专利的主题和技术特征。
除了上述提及的专利检索场景,本申请还可以基于服务器的形式应用于诸多其他场景。下面结合图1进行简单的介绍。
图1所示为本申请一实施例提供的文本识别方法的应用场景示意图。图1所示的场景包括服务器110以及与服务器110通信连接的客户端120。具体而言,服务器110用于基于待识别文本确定多个词向量,多个词向量用于表征待识别文本的语义语法信息;基于多个词向量生成多个词向量各自对应的词向量识别结果,词向量识别结果为功效文本或非功效文本;以及基于多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果。
示例性地,在实际应用过程中,客户端120可以接收用户输入的待识别文本,并将接收的待识别文本发送至服务器110,服务器110基于接收的待识别文本生成文本识别结果,并将生成的文本识别结果发送至客户端120,客户端120将接收的文本识别结果呈现给用户。
示例性方法
下面结合图2至图13对本申请提供的文本识别方法进行简单的介绍。
图2所示为本申请一实施例提供的一种文本识别方法的流程示意图。如图2所示,本申请实施例提供的文本识别方法包括如下步骤。
步骤210,基于待识别文本确定待识别文本对应的多个字符串。
示例性地,多个字符串中相邻的字符串有部分重叠。例如,待识别文本为:“本实用新型涉及农业机械技术领域,具体涉及一种履带式收割机脱粒部防护罩整体配置的结构。包括脱粒机架,还包括整体防护罩”。待识别文本对应的多个字符串可以为:“Span1:本实用新型涉及农业机械技术领域”,“Span2:械技术领域,具体涉及一种履带式收割机脱粒部防护”,“Span3:脱粒部防护罩整体配置的结构。包括脱粒机架”,“Span4:结构。包括脱粒机架,还包括整体防护罩”。“Span”表示字符串,“Span1”表示第一个字符串,“Span2”表示第二个字符串,以此类推。每个字符串,可以包括相同数量的字符,也可以包括不同数量的字符,本申请不做具体限定。相邻的字符串可以包括部分相同的字符,例如,第一个字符串和第二个字符串都包括如下字符“械技术领域”。
待识别文本可以是一个句子,可以是一个段落,也可以是一篇文本,本申请不做具体限定。
步骤220,对多个字符串进行词向量转化,得到多个词向量。
示例性地,多个词向量与多个字符串呈一一对应关系。对多个字符串进行词向量转化可以使用Word2vec(word to vector)进行转化。Word2vec是用来产生词向量的模型。将字符串输入Word2vec,即可得到对应的词向量。
步骤230,基于多个词向量生成多个词向量各自对应的词向量识别结果。
示例性地,词向量识别结果为功效文本或非功效文本。每一个词向量对应一个词向量识别结果。功效文本是表征技术效果的文本,非功效文本即除去表征技术效果的文本之外的其他文本。
在本申请一实施例中,待识别文本可以是专利文本,功效文本是专利文本中表征技 术效果的文本,非功效文本是专利文本中除去表征技术效果的文本之外的其他文本。
步骤240,基于多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果。
具体地,如果待识别文本是一个句子,待识别文本的文本识别结果可以是该句子是功效文本或非功效文本。如果待识别文本是一个段落,待识别文本的文本识别结果可以是该段落是功效文本或非功效文本。如果待识别文本是一篇文本,待识别文本的文本识别结果可以是该文本中的一个句子或一个段落被标注为功效文本。
通过确定待识别文本对应的多个字符串,并对多个字符串进行词向量转化,得到多个词向量,可以使用多个词向量来综合表征待识别文本,为文本识别提供准确的数据支持。
本申请实施例提供的文本识别方法,通过基于待识别文本确定待识别文本对应的多个字符串,且多个字符串中相邻的字符串有部分重叠,从而可以使多个字符串体现待识别文本的上下文之间的关系。然后,对多个字符串进行词向量转化,得到多个词向量,基于多个词向量生成多个词向量各自对应的词向量识别结果,以确定词向量对应的文本是功效文本还是非功效文本,并根据多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果,可以更加细致的进行文本识别,并且综合多个词向量识别结果来确定待识别文本的文本识别结果,识别到了待识别文本的上下文之间的关系,提高了文本识别的准确性。另外,本申请的文本识别方法不需要人工标引,提高了文本识别效率。
图3所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图2所示实施例的基础上延伸出本申请图3所示实施例,下面着重叙述图3所示实施例与图2所示实施例的不同之处,相同之处不再赘述。如图3所示,基于待识别文本确定待识别文本对应的多个字符串的步骤,包括如下步骤。
步骤310,基于待识别文本确定预设字符串长度的第1个字符串。
示例性地,多个字符串的数量为M,M为大于或等于1的正整数。预设字符串长度可以是预先设定的字符串的长度,例如,可以是256个字符,也可以是128个字符,预设字符串长度可以根据实际需求进行选择,本申请不做具体限定。
步骤320,以第N个字符串中的字符为第N+1个字符串的起点字符,基于预设字符串长度和预设分解步长分解待识别文本,得到第N+1个字符串。
示例性地,N为大于或等于1,且小于M的正整数。预设分解步长可以是128个字符,也可以是64个字符,预设分解步长可以根据实际需求进行选择,本申请不做具体限定。
示例性地,可以使用滑动窗口的方法来确定多个字符串。如图4a所示,待识别文本为:“本实用新型涉及农业机械技术领域,具体涉及一种履带式收割机脱粒部防护罩整体配置的结构。包括脱粒机架,还包括整体防护罩。”。滑动窗口的预设字符串长度可以是20个字符,滑动窗口的预设分解步长可以是10个字符。待识别文本对应的多个字符串为:“Span1:本实用新型涉及农业机”,“Span2:涉及农业机械技术领域”,“Span3:械技术领域,具体涉及”,“Span4:,具体涉及一种履带式”,“Span5:一种履带式收割机脱粒”,“Span6:收割机脱粒部防护罩整”,“Span7:部防护罩整体配置的结”,“Span8:体配置的结构。包括脱”,“Span9:构。包括脱粒机架,还”,“Span10:粒机架,还包括整体防”,“Span11:包括整体防护罩。Padding Padding”。即,图4a中,字符串的数量M为11个。如果N为1,则第1个字符串为Span1,第N+1个字符串为Span2,即Span1中的某个字符为Span2的起点字符。“Padding”为一个自动占位符,如果最后一个字符串的长度小于预设字符串长度时,可以使用自动占位符补充字符串的长度,使最后一个字符串的长度等于预设字符串长度。如图4b所示,图4b中的一个矩形表示一个字符串,打有剖面线的矩形表示相邻字符串的重叠部分。
通过先确定预设字符串长度的第1个字符串,再以第N个字符串中的字符为第N+1 个字符串的起点字符,基于预设字符串长度和预设分解步长分解待识别文本,得到第N+1个字符串,可以使每个字符串包括同样数量的字符,且相邻字符串包括数量相同的重合字符,提高了字符串分解的均匀性,更好的表征了字符串之间的语义语法关系,从而为文本识别提供更加准确的数据支持。
图5所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图2所示实施例的基础上延伸出本申请图5所示实施例,下面着重叙述图5所示实施例与图2所示实施例的不同之处,相同之处不再赘述。如图5所示,基于多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果的步骤,包括如下步骤。
步骤510,采用投票机制,对多个词向量各自对应的词向量识别结果进行投票操作,确定待识别文本的文本识别结果。
示例性地,针对一个待识别文本,如果在待识别文本对应的词向量识别结果中,功效文本的数量大于或等于非功效文本的数量,确定待识别文本的文本识别结果为功效文本。如果在待识别文本对应的词向量识别结果中,功效文本的数量小于非功效文本的数量,确定待识别单元的文本识别结果为非功效文本。
由于多个字符串中相邻的字符串有部分重叠,因此通过采用投票机制,对多个词向量各自对应的词向量识别结果进行投票操作,确定待识别文本的文本识别结果,能够识别到待识别文本的上下文之间的关系,提高了文本识别的准确性,同时投票机制符合少数服从多数的投票原则,进一步提高了文本识别的准确性。
图6a所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图5所示实施例的基础上延伸出本申请图6a所示实施例,下面着重叙述图6a所示实施例与图5所示实施例的不同之处,相同之处不再赘述。如图6a所示,采用投票机制,对多个词向量各自对应的词向量识别结果进行投票操作,确定待识别文本的文本识别结果的步骤,包括如下步骤。
步骤610,基于待识别文本确定待识别文本对应的多个待识别单元。
示例性地,待识别单元对应至少一个词向量。如果待识别文本是一篇文本,则待识别单元可以使该篇文本中的一个句子或者一个段落。字符串可以是对该句子或者段落进行分解得到的文本片段。词向量与字符串一一对应。
在本申请一实施例中,待识别文本为专利文本,待识别单元可以使专利文本中的句子、段落和文本模块中的至少一种。文本模块可以使摘要模块、权利要求书模块和说明书模块中的至少一种。
步骤620,针对多个待识别单元中的每个待识别单元,采用投票机制,对待识别单元对应的词向量识别结果进行投票操作,确定待识别单元的文本识别结果。
示例性地,待识别单元的文本识别结果可以是功效文本或非功效文本。
具体地,如图6b和6c所示,T为一个待识别单元。一个待识别单元与字符串的对应情况可以有以下四种。第一种,如图6b所示,一个字符串包含一个待识别单元。第二种,一个字符串与一个待识别单元没有对应关系,如图6c所示,Span1和Span7与T的关系即属于第二种情况。第三种,一个字符串与一个待识别单元存在重叠的部分或一个待识别单元包含至少一个字符串,如图6c所示,Span2至Span6与T的关系即属于第三种情况。对于第一种情况,该待识别单元的文本识别结果取决于该字符串对应的词向量识别结果,即,如果该字符串对应的词向量识别结果是功效文本,则该待识别单元的文本识别结果即为功效文本,如果该字符串对应的词向量识别结果是非功效文本,则该待识别单元的文本识别结果即为非功效文本。对于第二种情况,该待识别单元的文本识别结果与该字符串对应的词向量识别结果不存在关联关系。对于第三种情况,该待识别单元的文本识别结果取决于与一个待识别单元存在重叠的部分和待识别单元包含的字符串对应的词向量识别结果。即,待识别单元T的文本识别结果取决于Span2至Span6的词向量识别结果。第三种情况,需要采用投票机制,对待识别单元对应的词向量识别 结果进行投票操作,确定待识别单元的文本识别结果。
步骤630,基于多个待识别单元各自对应的文本识别结果,确定待识别文本对应的文本识别结果。
示例性地,待识别文本对应的文本识别结果可以是部分句子或段落标注为功效文本。
通过确定待识别单元的文本识别结果,并基于多个待识别单元各自对应的文本识别结果,确定待识别文本对应的文本识别结果,可以识别出一篇文本中的多个部分是否是功效文本,从而对属于功效文本的部分进行标注,方便用户查看。
图7所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图6所示实施例的基础上延伸出本申请图7所示实施例,下面着重叙述图7所示实施例与图6所示实施例的不同之处,相同之处不再赘述。如图7所示,针对多个待识别单元中的每个待识别单元,采用投票机制,对待识别单元对应的词向量识别结果进行投票操作,确定待识别单元的文本识别结果的步骤,包括如下步骤。
步骤710,针对多个待识别单元中的每个待识别单元,如果在待识别单元对应的词向量识别结果中,功效文本的数量大于或等于非功效文本的数量,确定待识别单元的文本识别结果为功效文本。
步骤720,针对多个待识别单元中的每个待识别单元,如果在待识别单元对应的词向量识别结果中,功效文本的数量小于非功效文本的数量,确定待识别单元的文本识别结果为非功效文本。
具体地,针对图6c所示的第三种情况,待识别单元的文本识别结果即为Span2至Span6的词向量识别结果中功效文本的数量和非功效文本的数量的对比结果。例如,Span2的词向量识别结果为功效文本,Span3的词向量识别结果为功效文本,Span4的词向量识别结果为功效文本,Span5的词向量识别结果为非功效文本,Span6的词向量识别结果为非功效文本,即功效文本的数量有3个,非功效文本的数量有2个,则功效文本的数量大于非功效文本的数量,则待识别单元的文本识别结果为功效文本。
针对多个待识别单元中的每个待识别单元,采用少数服从多数的投票原则确定待识别单元的文本识别结果,进一步提高了文本识别的准确性。
图8所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图2所示实施例的基础上延伸出本申请图8所示实施例,下面着重叙述图8所示实施例与图2所示实施例的不同之处,相同之处不再赘述。如图8所示,基于多个词向量生成多个词向量各自对应的词向量识别结果的步骤,包括如下步骤。
步骤810,利用功效识别模型,基于多个词向量生成多个词向量各自对应的词向量识别结果。
示例性地,功效识别模型用于基于输入的词向量生成输入的词向量对应的词向量识别结果。
通过利用功效识别模型生成多个词向量各自对应的词向量识别结果,不需要进行人工标引,能够自动得到词向量识别结果,降低了人力成本。
图9所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图8所示实施例的基础上延伸出本申请图9所示实施例,下面着重叙述图9所示实施例与图8所示实施例的不同之处,相同之处不再赘述。如图9所示,在利用功效识别模型,基于多个词向量生成多个词向量各自对应的词向量识别结果的步骤之前,还包括如下步骤。
步骤910,确定训练文本以及训练文本对应的文本识别结果。
具体地,步骤910中提及的训练文本与上述实施例中的待识别文本对应。
步骤920,基于训练文本和训练文本对应的文本识别结果,确定训练文本对应的多个词向量样本以及多个词向量样本各自对应的词向量识别结果。
具体地,确定训练文本对应的多个词向量样本的方法参见上述实施例中确定待识别文本对应的多个词向量的方法,在此不再赘述。确定文本识别结果对应的多个词向量识别结果可以是将文本识别结果中的功效文本对应的词向量的词向量识别结果标注为功效文本,并将文本识别结果中的非功效文本对应的词向量的词向量识别结果标注为非功效文本。
步骤930,建立初始网络模型,并基于多个词向量样本和多个词向量样本各自对应的词向量识别结果训练初始网络模型,以生成功效识别模型。
步骤930提及的功效识别模型用于基于输入的词向量生成输入的词向量对应的词向量识别结果。初始网络模型可以是BERT模型。BERT是开源的预训练语言模型,预训练语言模型是在广泛的数据集上训练得到的,因此BERT模型是一个已有一定语言基础先验知识的模型,利用BERT模型作为初始网络模型进行训练,在训练过程中,BERT模型不断调整模型框架中的参数,通过不断迭代调整,使BERT模型达到最优效果,提高了初始网络模型的学习效率,同时提高了功效识别模型的准确性。
图10所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图9所示实施例的基础上延伸出本申请图10所示实施例,下面着重叙述图10所示实施例与图9所示实施例的不同之处,相同之处不再赘述。如图10所示,确定训练文本以及训练文本对应的文本识别结果的步骤,包括如下步骤。
步骤1010,获取第一语言训练文本和第二语言训练文本。
步骤1020,基于第一语言训练文本和第一文本识别结果,得到第一语言训练文本对应的第一功效标记文本。
示例性地,第一功效标记文本可以是功效标记文本。
示例性地,训练文本包括第一语言训练文本和第二语言训练文本,第一语言训练文本和第二语言训练文本包括分别使用不同语言撰写的等同的内容,第一语言训练文本包括使用第一语言撰写的文本内容,第二语言训练文本包括使用第二语言撰写的文本内容,第一语言训练文本对应的文本识别结果为第一文本识别结果,第二语言训练文本对应的文本识别结果为第二文本识别结果。文本识别结果包括第一文本识别结果和第二文本识别结果。
在本申请一实施例中,训练文本为专利文本,第一语言训练文本和第二语言训练文本属于同一个专利家族。在本申请一实施例中,第一语言包括日语,第一语言训练文本包括日本专利文本。
具体地,日本专利文本多数被标注有“发明效果”的标记,例如,标记了“発明の効果”。在实际应用中,可以先筛选出日本专利文本,然后再筛选出标记有“发明效果”的日本专利文本。第一功效标记文本可以是标记了“发明效果”的效果部分的文本。
步骤1030,基于第一功效标记文本,确定第二语言训练文本对应的第二文本识别结果。
通过确定第一语言训练文本和第二语言训练文本,并基于第一语言训练文本的第一功效标记文本,确定第二语言训练文本的第二文本识别结果,可以使原本没有第二文本识别结果的第二语言训练文本,快速的确定了第二文本识别结果,提高了训练文本和训练文本对应的文本识别结果的确定效率。
图11a所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图10所示实施例的基础上延伸出本申请图11a所示实施例,下面着重叙述图11a所示实施例与图10所示实施例的不同之处,相同之处不再赘述。如图11a所示,基于第一功效标记文本,确定第二语言训练文本对应的第二文本识别结果的步骤,包括如下步骤。
步骤1110,翻译第一功效标记文本,得到第一功效标记文本对应的翻译文本。
示例性地,翻译文本利用第二语言表达。第一语言可以是日语。如图11b所示,第一语言的第一功效标记文本可以是标记了“発明の効果”的效果部分的文本。第二语言 可以是汉语,可以是英语,也可以是其他语言,本申请不做具体限定。
步骤1120,采用相似度算法,基于第一功效标记文本对应的翻译文本,确定第二文本识别结果。
通过计算第二语言的第一功效标记文本与第二语言训练文本的相似度,可以得到第二语言训练文本中与第二语言的第一功效标记文本相似度最高的句子或段落,即与第二语言的第一功效标记文本相似度最高的句子或段落为第二语言训练文本的第二文本识别结果,从而快速的确定了第二文本识别结果,提高了训练文本和训练文本对应的文本识别结果的确定效率。
在本申请一实施例中,翻译第一功效标记文本,得到第一功效标记文本对应的翻译文本的步骤,包括如下步骤。
将第一功效标记文本输入文本翻译模型,以生的第一功效标记文本对应的翻译文本。
通过使用文本翻译模型将第一语言的第一功效标记文本翻译成第二语言的第一功效标记文本,进一步提高了文本识别的效率。
图12所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图10所示实施例的基础上延伸出本申请图12所示实施例,下面着重叙述图12所示实施例与图10所示实施例的不同之处,相同之处不再赘述。如图12所示,获取第一语言训练文本和第二语言训练文本的步骤包括以下步骤。
步骤1210,获取包括多种不同语言的专利家族文本数据。
示例性地,专利家族文本数据可以是简单家族文本数据。
步骤1220,基于专利家族文本数据筛选第一语言训练文本和第二语言训练文本。
示例性地,第一语言训练文本包括功效标识段落信息。第二语言训练文本包括待进行功效标记的专利文本。
在实际应用中,以第一语言训练文本为日本专利文本,第二语言训练文本为中文专利文本为例,可以通过以下步骤得到训练文本:
(1)可以先从专利数据库中筛选出多组家族专利,每组家族专利中包括日语专利和中文专利。
(2)筛选出所有家族专利中的日语专利,再从所有的日语专利中筛选出标记有“发明效果”的日本专利文本作为。
(3)将标记有“发明效果”的部分翻译为中文文本。
(4)将翻译出的中文文本与整篇中文专利进行相似度计算,得到整篇中文专利中与翻译出的中文文本相似度较高的部分作为文本识别结果。
图13所示为本申请另一实施例提供的一种文本识别方法的流程示意图。在本申请图10所示实施例的基础上延伸出本申请图13所示实施例,下面着重叙述图13所示实施例与图10所示实施例的不同之处,相同之处不再赘述。如图13所示,在基于第一功效标记文本,确定第二语言训练文本对应的第二文本识别结果的步骤之后,还包括如下步骤。
步骤1310,确定第二语言训练文本对应的第二功效标记文本。
具体地,第二语言训练文本也可以包括第二功效标记文本,例如,部分中文专利也可以包括标记了“发明效果”的效果部分的文本。第二功效标记文本可以是标记了“发明效果”的效果部分的文本,即功效标记文本。
步骤1320,基于第二文本识别结果与第二功效标记文本,确定第二文本识别结果的准确度。
具体地,准确度可以是精确度A,也可以是召回率R。
精确度的计算公式如下:
Figure PCTCN2022107580-appb-000001
召回率的计算公式如下:
Figure PCTCN2022107580-appb-000002
其中,W 1为所有训练文本中的第二语言训练文本的第二文本识别结果中功效段落或功效句子的数量,W 2为所有训练文本中的第二语言训练文本的第二功效标记文本中功效段落或功效句子的数量。
本申请通过对比试验得出:以本申请的方法得到的精确度和召回率都大于90%。另外,本申请的方法只需要十几万的训练文本,即可达到满意的效果。
步骤1330,根据第二文本识别结果的准确度调整训练文本。
具体地,如果在确定训练文本时,发现准确度不高,例如,精确度和召回率都小于80%,则可以调整训练文本,例如可以重新选择训练文本。
通过计算第二文本识别结果的准确度来调整训练文本,提高了训练文本的质量,从而提高了功效识别模型的质量,进一步提高了文本识别的准确性。
示例性装置
上文结合图2至图13,详细描述了本申请的方法实施例,下面结合图14至图24,详细描述本申请的装置实施例。方法实施例的描述与装置实施例的描述相互对应,因此,未详细描述的部分可以参见前面的方法实施例。
图14所示为本申请一实施例提供的一种文本识别装置的结构示意图。如图14所示,本申请实施例提供的文本识别装置1400包括:
拆分模块1410,配置为基于待识别文本确定待识别文本对应的多个字符串,其中,多个字符串中相邻的字符串有部分重叠;
转化模块1420,配置为对多个字符串进行词向量转化,得到多个词向量,其中,多个词向量与多个字符串呈一一对应关系;
生成模块1430,配置为基于多个词向量生成多个词向量各自对应的词向量识别结果,其中,词向量识别结果为功效文本或非功效文本;
确定模块1440,配置为基于多个词向量各自对应的词向量识别结果确定待识别文本的文本识别结果。
图15所示为本申请一实施例提供的一种字符串确定单元的结构示意图。在本申请图14所示实施例基础上延伸出本申请图15所示实施例,下面着重叙述图15所示实施例与图14所示实施例的不同之处,相同之处不再赘述。
如图15所示,在本申请实施例提供的文本识别装置1400中,拆分模块1410包括:
第一个字符串确定单元1511,配置为基于待识别文本确定预设字符串长度的第1个字符串;
第N个字符串确定单元1512,配置为以第N个字符串中的字符为第N+1个字符串的起点字符,基于预设字符串长度和预设分解步长分解待识别文本,得到第N+1个字符串,其中,N为大于或等于1,且小于M的正整数。
图16所示为本申请一实施例提供的一种确定模块的结构示意图。在本申请图14所示实施例基础上延伸出本申请图16所示实施例,下面着重叙述图16所示实施例与图14所示实施例的不同之处,相同之处不再赘述。
如图16所示,在本申请实施例提供的文本识别装置1400中,确定模块1440包括:
投票确定单元1441,配置为采用投票机制,对多个词向量各自对应的词向量识别结果进行投票操作,确定待识别文本的文本识别结果。
图17所示为本申请一实施例提供的一种投票单元的结构示意图。在本申请图16 所示实施例基础上延伸出本申请图17所示实施例,下面着重叙述图17所示实施例与图16所示实施例的不同之处,相同之处不再赘述。
如图17所示,在本申请实施例提供的文本识别装置1400中,投票确定单元1441包括:
投票前处理子单元1711,配置为基于待识别文本确定待识别文本对应的多个待识别单元,其中,待识别单元对应至少一个词向量;
投票子单元1712,配置为针对多个待识别单元中的每个待识别单元,采用投票机制,对待识别单元对应的词向量识别结果进行投票操作,确定待识别单元的文本识别结果;
投票后处理子单元1713,配置为基于多个待识别单元各自对应的文本识别结果,确定待识别文本对应的文本识别结果。
图18所示为本申请一实施例提供的一种投票操作子单元的结构示意图。在本申请图17所示实施例基础上延伸出本申请图18所示实施例,下面着重叙述图18所示实施例与图17所示实施例的不同之处,相同之处不再赘述。
如图18所示,在本申请实施例提供的文本识别装置1400中,投票子单元1712包括:
投票操作子单元1811,配置为针对多个待识别单元中的每个待识别单元,如果在待识别单元对应的词向量识别结果中,功效文本的数量大于或等于非功效文本的数量,确定待识别单元的文本识别结果为功效文本。
图19所示为本申请一实施例提供的一种生成模块的结构示意图。在本申请图14所示实施例基础上延伸出本申请图19所示实施例,下面着重叙述图19所示实施例与图14所示实施例的不同之处,相同之处不再赘述。
如图19所示,在本申请实施例提供的文本识别装置1400中,生成模块1430包括:
生成单元1431,配置为利用功效识别模型,基于多个词向量生成多个词向量各自对应的词向量识别结果,其中,功效识别模型用于基于输入的词向量生成输入的词向量对应的词向量识别结果。
图20所示为本申请另一实施例提供的一种文本识别装置的结构示意图。在本申请图19所示实施例基础上延伸出本申请图20所示实施例,下面着重叙述图20所示实施例与图19所示实施例的不同之处,相同之处不再赘述。
如图20所示,在本申请实施例提供的文本识别装置1400还包括:
样本预处理模块1450,配置为确定训练文本以及训练文本对应的文本识别结果;
样本确定模块1460,配置为基于训练文本和训练文本对应的文本识别结果,确定训练文本对应的多个词向量样本以及多个词向量样本各自对应的词向量识别结果;
模型确定模块1470,配置为建立初始网络模型,并基于多个词向量样本和多个词向量样本各自对应的词向量识别结果训练初始网络模型,以生成功效识别模型。
图21所示为本申请一实施例提供的一种样本预处理模块的结构示意图。在本申请图20所示实施例基础上延伸出本申请图21所示实施例,下面着重叙述图21所示实施例与图20所示实施例的不同之处,相同之处不再赘述。
如图21所示,在本申请实施例提供的文本识别装置1400中,样本预处理模块1450包括:
第一样本预处理单元1451,配置为获取第一语言训练文本和第二语言训练文本;
第二样本预处理单元1452,基于第一语言训练文本和第一文本识别结果,得到第一语言训练文本对应的第一功效标记文本;
第三样本预处理单元1453,配置为基于第一功效标记文本,确定第二语言训练文本对应的第二文本识别结果。
图22所示为本申请一实施例提供的一种第二样本预处理单元的结构示意图。在本 申请图21所示实施例基础上延伸出本申请图22所示实施例,下面着重叙述图22所示实施例与图21所示实施例的不同之处,相同之处不再赘述。
如图22所示,在本申请实施例提供的文本识别装置1400中,第二样本预处理单元1452包括:
翻译子单元2210,配置为翻译第一功效标记文本,得到第一功效标记文本对应的翻译文本,其中,翻译文本利用第二语言表达;
相似度确定子单元2220,配置为采用相似度算法,基于第一功效标记文本对应的翻译文本,确定第二文本识别结果。
翻译子单元2210,进一步配置为将第一功效标记文本输入文本翻译模型,以生成第一功效标记文本对应的翻译文本。
图23所示为本申请另一实施例提供的一种样本预处理模块的结构示意图。在本申请图21所示实施例基础上延伸出本申请图23所示实施例,下面着重叙述图23所示实施例与图21所示实施例的不同之处,相同之处不再赘述。
如图23所示,在本申请实施例提供的文本识别装置1400中,样本预处理模块1450还包括:
第四样本预处理单元1454,配置为确定第二语言训练文本对应的第二功效标记文本;
对比单元1455,配置为基于第二文本识别结果与第二功效标记文本,确定第二文本识别结果的准确度;
调整单元1456,配置为根据第二文本识别结果的准确度调整训练文本。
图24所示为本申请另一实施例提供的一种第一样本预处理单元的结构示意图。在本申请图21所示实施例基础上延伸出本申请图24所示实施例,下面着重叙述图24所示实施例与图21所示实施例的不同之处,相同之处不再赘述。
如图24所示,在本申请实施例提供的文本识别装置1400中,第一样本预处理单元1451还包括:
样本获取子单元2410,配置为获取包括多种不同语言的专利家族文本数据。
筛选子单元2420,配置为基于专利家族文本数据筛选第一语言训练文本和第二语言训练文本。
示例性电子设备
图25所示为本申请一实施例提供的电子设备的结构示意图。如图25所示,该电子设备250包括:一个或多个处理器2501和存储器2502;以及存储在存储器2502中的计算机程序指令,计算机程序指令在被处理器2501运行时使得处理器2501执行如上述任一实施例的文本识别方法。
处理器2501可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备中的其他组件以执行期望的功能。
存储器2502可以包括一个或多个计算机程序产品,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器2501可以运行程序指令,以实现上文的本申请的各个实施例的文本识别方法中的步骤以及/或者其他期望的功能。
在一个示例中,电子设备250还可以包括:输入装置2503和输出装置2504,这些组件通过总线系统和/或其他形式的连接机构(图25中未示出)互连。
此外,该输入装置2503还可以包括例如键盘、鼠标、麦克风等等。
该输出装置2504可以向外部输出各种信息,例如可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图25中仅示出了该电子设备250中与本申请有关的组件中的一些,省略了诸如总线、输入装置/输出接口等组件。除此之外,根据具体应用情况,电子设备250还可以包括任何其他适当的组件。
示例性计算机可读存储介质
除了上述方法和设备以外,本申请的实施例还可以是计算机程序产品,包括计算机程序指令,计算机程序指令在被处理器运行时使得处理器执行如上述任一实施例的文本识别方法中的步骤。
计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本申请实施例操作的程序代码,程序设计语言包括面向对象的程序设计语言,诸如Java、C++等,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。
此外,本申请的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,计算机程序指令在被处理器运行时使得处理器执行本说明书上述“示例性方法”部分中描述的根据本申请各种实施例的文本识别方法中的步骤。
计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器((RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
以上结合具体实施例描述了本申请的基本原理,但是,需要指出的是,在本申请中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本申请的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本申请为必须采用上述具体的细节来实现。
本申请中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
还需要指出的是,在本申请的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本申请的等效方案。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本申请。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本申请的范围。因此,本申请不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本申请的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种文本识别方法,包括:
    基于待识别文本确定所述待识别文本对应的多个字符串,其中,所述多个字符串中相邻的所述字符串有部分重叠;
    对所述多个字符串进行词向量转化,得到多个词向量,其中,所述多个词向量与所述多个字符串呈一一对应关系;
    基于所述多个词向量生成所述多个词向量各自对应的词向量识别结果,其中,所述词向量识别结果为功效文本或非功效文本;以及
    基于所述多个词向量各自对应的词向量识别结果确定所述待识别文本的文本识别结果。
  2. 根据权利要求1所述的文本识别方法,其中,所述多个字符串的数量为M,M为大于1的正整数,所述基于所述待识别文本确定所述待识别文本对应的多个字符串,包括:
    基于所述待识别文本确定预设字符串长度的第1个字符串;
    以第N个字符串中的字符为第N+1个字符串的起点字符,基于所述预设字符串长度和预设分解步长分解所述待识别文本,得到所述第N+1个字符串,其中,N为大于或等于1,且小于M的正整数。
  3. 根据权利要求1所述的文本识别方法,其中,所述基于所述多个词向量各自对应的词向量识别结果确定所述待识别文本的文本识别结果,包括:
    采用投票机制,对所述多个词向量各自对应的词向量识别结果进行投票操作,确定所述待识别文本的所述文本识别结果。
  4. 根据权利要求3所述的文本识别方法,其中,所述采用投票机制,对所述多个词向量各自对应的所述词向量识别结果进行投票操作,确定所述待识别文本的所述文本识别结果,包括:
    基于所述待识别文本确定所述待识别文本对应的多个待识别单元,其中,所述待识别单元对应至少一个所述词向量;
    针对所述多个待识别单元中的每个待识别单元,采用所述投票机制,对所述待识别单元对应的词向量识别结果进行所述投票操作,确定所述待识别单元的所述文本识别结果;
    基于所述多个待识别单元各自对应的文本识别结果,确定所述待识别文本对应的文本识别结果。
  5. 根据权利要求4所述的文本识别方法,其中,所述待识别文本为专利文本,所述待识别单元包括所述专利文本中的句子、段落和文本模块中的至少一种;其中,所述文本模块包括摘要模块、权利要求书模块和说明书模块中的至少一种。
  6. 根据权利要求4所述的文本识别方法,其中,所述采用所述投票机制,对所述待识别单元对应的词向量识别结果进行所述投票操作,确定所述待识别单元的所述文本识别结果,包括:
    如果在所述待识别单元对应的词向量识别结果中,所述功效文本的数量大于或等于所述非功效文本的数量,确定所述待识别单元的所述文本识别结果为所述功效文本。
  7. 根据权利要求4所述的文本识别方法,其中,所述采用所述投票机制,对所述待识别单元对应的词向量识别结果进行所述投票操作,确定所述待识别单元的所述文本识别结果,包括:
    如果在所述待识别文本对应的词向量识别结果中,所述功效文本的数量小于所述非功效文本的数量,确定所述待识别单元的所述文本识别结果为所述非功效文本。
  8. 根据权利要求1至7任一项所述的文本识别方法,其中,所述基于所述多个词向量生成所述多个词向量各自对应的词向量识别结果,包括:
    利用功效识别模型,基于所述多个词向量生成所述多个词向量各自对应的词向量识别 结果,其中,所述功效识别模型用于基于输入的词向量生成所述输入的词向量对应的词向量识别结果。
  9. 根据权利要求8所述的文本识别方法,其中,在所述利用功效识别模型,基于所述多个词向量生成所述多个词向量各自对应的词向量识别结果之前,还包括:
    确定训练文本以及所述训练文本对应的文本识别结果;
    基于所述训练文本和所述训练文本对应的文本识别结果,确定所述训练文本对应的多个词向量样本以及所述多个词向量样本各自对应的词向量识别结果;
    建立初始网络模型,并基于所述多个词向量样本和所述多个词向量样本各自对应的词向量识别结果训练所述初始网络模型,以生成所述功效识别模型。
  10. 根据权利要求8所述的文本识别方法,其中,所述初始网络模型包括:BERT模型。
  11. 根据权利要求9所述的文本识别方法,其中,所述训练文本包括第一语言训练文本和第二语言训练文本,所述第一语言训练文本包括使用第一语言撰写的文本内容,所述第二语言训练文本包括使用第二语言撰写的所述文本内容,所述第一语言训练文本对应的文本识别结果为第一文本识别结果,所述第二语言训练文本对应的文本识别结果为第二文本识别结果;
    所述确定训练文本以及训练文本对应的所述文本识别结果,包括:
    获取所述第一语言训练文本和所述第二语言训练文本;
    基于所述第一语言训练文本和所述第一文本识别结果,得到所述第一语言训练文本对应的第一功效标记文本;
    基于所述第一功效标记文本,确定所述第二语言训练文本对应的所述第二文本识别结果。
  12. 根据权利要求11所述的文本识别方法,其中,所述基于所述第一功效标记文本,确定所述第二语言训练文本对应的所述第二文本识别结果,包括:
    翻译所述第一功效标记文本,得到所述第一功效标记文本对应的翻译文本,其中,所述翻译文本利用所述第二语言表达;
    采用相似度算法,基于所述第一功效标记文本对应的翻译文本,确定所述第二文本识别结果。
  13. 根据权利要求12所述的文本识别方法,其中,所述翻译所述第一功效标记文本,得到所述第一功效标记文本对应的翻译文本,包括:
    将所述第一功效标记文本输入文本翻译模型,以生的所述第一功效标记文本对应的翻译文本。
  14. 根据权利要求9至12任一项所述的文本识别方法,其中,所述训练文本为专利文本,获取所述第一语言训练文本和所述第二语言训练文本,包括:
    获取包括多种不同语言的专利家族文本数据;
    基于所述专利家族文本数据筛选所述第一语言训练文本和所述第二语言训练文本,其中,所述第一语言训练文本包括功效标识段落信息,所述第二语言训练文本包括待进行功效标记的专利文本。
  15. 根据权利要求14所述的文本识别方法,其中,所述第一语言包括日语,所述第一语言训练文本包括日本专利文本。
  16. 根据权利要求14所述的文本识别方法,其中,所述第一语言训练文本和所述第二语言训练文本属于同一个专利家族。
  17. 根据权利要求14所述的文本识别方法,其中,所述第一语言训练文本包括日本专利文本,所述第二语言训练文本为中文专利文本。
  18. 一种文本识别装置,包括:
    拆分模块,配置为基于待识别文本确定所述待识别文本对应的多个字符串,其中,所述多个字符串中相邻的所述字符串有部分重叠;
    转化模块,配置为对所述多个字符串进行词向量转化,得到多个词向量,其中,所述多个词向量与所述多个字符串呈一一对应关系;
    生成模块,配置为基于所述多个词向量生成所述多个词向量各自对应的词向量识别结果,其中,所述词向量识别结果为功效文本或非功效文本;以及
    确定模块,配置为基于所述多个词向量各自对应的词向量识别结果确定所述待识别文本的文本识别结果。
  19. 一种计算机可读存储介质,所述存储介质存储有指令,当所述指令由电子设备的处理器执行时,使得所述电子设备能够执行上述权利要求1至17任一项所述的文本识别方法。
  20. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储计算机可执行指令的存储器;
    所述处理器,用于执行所述计算机可执行指令,以实现上述权利要求1至17任一项所述的文本识别方法。
PCT/CN2022/107580 2021-07-23 2022-07-25 文本识别方法及装置、计算机可读存储介质和电子设备 WO2023001308A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/217,766 US20230351110A1 (en) 2021-07-23 2023-07-03 Text recognition method and apparatus, computer-readable storage medium and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110839399.2A CN113723096A (zh) 2021-07-23 2021-07-23 文本识别方法及装置、计算机可读存储介质和电子设备
CN202110839399.2 2021-07-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/217,766 Continuation US20230351110A1 (en) 2021-07-23 2023-07-03 Text recognition method and apparatus, computer-readable storage medium and electronic device

Publications (1)

Publication Number Publication Date
WO2023001308A1 true WO2023001308A1 (zh) 2023-01-26

Family

ID=78673900

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107580 WO2023001308A1 (zh) 2021-07-23 2022-07-25 文本识别方法及装置、计算机可读存储介质和电子设备

Country Status (3)

Country Link
US (1) US20230351110A1 (zh)
CN (1) CN113723096A (zh)
WO (1) WO2023001308A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723096A (zh) * 2021-07-23 2021-11-30 智慧芽信息科技(苏州)有限公司 文本识别方法及装置、计算机可读存储介质和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293507A1 (en) * 2017-04-06 2018-10-11 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for extracting keywords based on artificial intelligence, device and readable medium
US20200395001A1 (en) * 2019-05-10 2020-12-17 Fmr Llc Building a knowledge base taxonomy from structured or unstructured computer text for use in automated user interactions
CN112632286A (zh) * 2020-09-21 2021-04-09 北京合享智慧科技有限公司 一种文本属性特征的识别、分类及结构分析方法及装置
CN112784603A (zh) * 2021-02-05 2021-05-11 北京信息科技大学 专利功效短语识别方法
CN113723096A (zh) * 2021-07-23 2021-11-30 智慧芽信息科技(苏州)有限公司 文本识别方法及装置、计算机可读存储介质和电子设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104536953B (zh) * 2015-01-22 2017-12-26 苏州大学 一种文本情绪极性的识别方法及装置
CN108628974B (zh) * 2018-04-25 2023-04-18 平安科技(深圳)有限公司 舆情信息分类方法、装置、计算机设备和存储介质
CN110929025B (zh) * 2018-09-17 2023-04-25 阿里巴巴集团控股有限公司 垃圾文本的识别方法、装置、计算设备及可读存储介质
CN111368535B (zh) * 2018-12-26 2024-01-16 珠海金山数字网络科技有限公司 一种敏感词识别方法、装置及设备
CN115455988A (zh) * 2018-12-29 2022-12-09 苏州七星天专利运营管理有限责任公司 一种高风险语句的处理方法和系统
CN112732912B (zh) * 2020-12-30 2024-04-09 平安科技(深圳)有限公司 敏感倾向表述检测方法、装置、设备及存储介质
CN113096667A (zh) * 2021-04-19 2021-07-09 上海云绅智能科技有限公司 一种错别字识别检测方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293507A1 (en) * 2017-04-06 2018-10-11 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for extracting keywords based on artificial intelligence, device and readable medium
US20200395001A1 (en) * 2019-05-10 2020-12-17 Fmr Llc Building a knowledge base taxonomy from structured or unstructured computer text for use in automated user interactions
CN112632286A (zh) * 2020-09-21 2021-04-09 北京合享智慧科技有限公司 一种文本属性特征的识别、分类及结构分析方法及装置
CN112784603A (zh) * 2021-02-05 2021-05-11 北京信息科技大学 专利功效短语识别方法
CN113723096A (zh) * 2021-07-23 2021-11-30 智慧芽信息科技(苏州)有限公司 文本识别方法及装置、计算机可读存储介质和电子设备

Also Published As

Publication number Publication date
CN113723096A (zh) 2021-11-30
US20230351110A1 (en) 2023-11-02

Similar Documents

Publication Publication Date Title
US11675977B2 (en) Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
US10114809B2 (en) Method and apparatus for phonetically annotating text
WO2023060795A1 (zh) 关键词自动提取方法、装置、设备及存储介质
WO2019153607A1 (zh) 智能应答方法、电子装置及存储介质
CN110444198B (zh) 检索方法、装置、计算机设备和存储介质
CN111324743A (zh) 文本关系抽取的方法、装置、计算机设备及存储介质
US20150170051A1 (en) Applying a Genetic Algorithm to Compositional Semantics Sentiment Analysis to Improve Performance and Accelerate Domain Adaptation
US20210224479A1 (en) Method for processing information, and storage medium
JP2020191075A (ja) Web APIおよび関連エンドポイントの推薦
CN113961685A (zh) 信息抽取方法及装置
WO2019201024A1 (zh) 用于更新模型参数的方法、装置、设备和存储介质
CN116303537A (zh) 数据查询方法及装置、电子设备、存储介质
WO2023001308A1 (zh) 文本识别方法及装置、计算机可读存储介质和电子设备
KR20240012245A (ko) 자연어처리 기반의 인공지능 모델을 이용한 faq를 자동생성하기 위한 방법 및 이를 위한 장치
US20210034621A1 (en) System and method for creating database query from user search query
CN114416926A (zh) 关键词匹配方法、装置、计算设备及计算机可读存储介质
US20220058214A1 (en) Document information extraction method, storage medium and terminal
US10614796B2 (en) Method of and system for processing a user-generated input command
CN112464927B (zh) 一种信息提取方法、装置及系统
US11574491B2 (en) Automated classification and interpretation of life science documents
CN113822059A (zh) 中文敏感文本识别方法、装置、存储介质及设备
US20230081015A1 (en) Method and apparatus for acquiring information, electronic device and storage medium
CN115858776B (zh) 一种变体文本分类识别方法、系统、存储介质和电子设备
JP2016519370A (ja) データ処理装置、データ処理方法及び電子機器
CN116486812A (zh) 基于语料关系的多领域唇语识别样本自动生成方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22845471

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE