CN111753066A - Method, device and equipment for expanding technical background text - Google Patents

Method, device and equipment for expanding technical background text Download PDF

Info

Publication number
CN111753066A
CN111753066A CN202010420142.9A CN202010420142A CN111753066A CN 111753066 A CN111753066 A CN 111753066A CN 202010420142 A CN202010420142 A CN 202010420142A CN 111753066 A CN111753066 A CN 111753066A
Authority
CN
China
Prior art keywords
text
paragraph
texts
retrieval
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010420142.9A
Other languages
Chinese (zh)
Inventor
刘恺
张灏
李强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xinju Intellectual Property Co ltd
Original Assignee
Beijing Xinju Intellectual Property Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xinju Intellectual Property Co ltd filed Critical Beijing Xinju Intellectual Property Co ltd
Publication of CN111753066A publication Critical patent/CN111753066A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents
    • G06Q50/184Intellectual property management

Abstract

The invention discloses a method, a device and equipment for expanding a technical background text, wherein the method comprises the following steps: determining at least one paragraph text which possibly does not have innovation in the technical cross-under text, then taking at least one retrieval sentence contained in the at least one paragraph text which possibly does not have innovation as a retrieval object, retrieving in a pre-established retrieval database, determining similarity values of the retrieved similar sentences and/or the similar paragraph texts and the retrieval sentences, and adding the similar paragraph text and/or the similar sentences with the similarity values higher than a similarity threshold value into the technical cross-under text as reference texts which are similar to the paragraph text which possibly does not have innovation. Compared with the method for expanding the embodiment text in a manual writing mode in the prior art, the method for expanding the embodiment text in the invention not only increases the retrieval efficiency and quality, saves a large amount of manpower and material resources, assists writing work of related personnel, and further improves the writing quality and efficiency.

Description

Method, device and equipment for expanding technical background text
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, and a device for expanding a technical background text.
Background
Patent documents are the largest technical information resource in the world, and statistics on the fact that the patent documents contain 90% -95% of scientific and technical information in the world draws more and more attention as intangible property. For example, the number of patent applications in China in 2019 is 140.1 ten thousand, and the number of patent applications in China is 45.3 ten thousand. However, by 12 months end in 2019, the national patent agency is 2649, and the practicing patent agencies broken through 2 thousands of people, although the growth was 1.9 times and 1.5 times, respectively, compared with 2012. However, compared with the difference between the number of patent applications and the number of patent agents, many applicants still cannot efficiently submit inventions generated in the research and development process to relevant departments for patent application.
Although the technical scheme of the invention is clear for the ordinary applicant, especially the inventor, the writing rules and requirements of the patent are not well known, and it is difficult to independently write a qualified application file. At present, there is no feasible auxiliary way to help the applicant who knows the technical solution but has not experienced enough patent application to form a preliminary application document, and there is no way to help the inexperienced applicant and inventor to quickly establish the concept of patent writing document and grasp the preliminary.
The main symptoms in practical applications are therefore: the patent applicant does not write technical filing texts, nor does a special patent writing system provide convenient patent writing services for the patent applicant, and paragraph texts which may not have innovativeness in the writing process are important text factors for generating specific embodiments, and it is particularly important to expand the paragraph texts which may not have innovativeness into contents of specific embodiments, so how to expand the technical filing texts to facilitate intelligent generation of patent application files is a problem that needs to be solved by those skilled in the art urgently.
Disclosure of Invention
In view of the above, the present invention has been made to provide a technical background text extension method, apparatus and device that overcome or at least partially solve the above problems.
In a first aspect, an embodiment of the present invention provides a method for extending a technical background text, which may include:
determining at least one paragraph text that may not be innovative in the technical cross-under text;
taking at least one retrieval statement contained in at least one paragraph text which possibly does not have innovation as a retrieval object, retrieving in a pre-established retrieval database, and determining the similarity value of the retrieved similar statement and/or similar paragraph text and the retrieval statement;
and comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity values higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which possibly do not have innovativeness.
Optionally, at least one retrieval statement included in the paragraph text that may not be innovative is used as a retrieval object, and retrieval is performed in a pre-established retrieval database, which further includes:
vectorizing at least one retrieval statement to obtain a retrieval statement vector;
taking at least one retrieval statement contained in the paragraph text which may not have innovativeness as a retrieval object, retrieving in a pre-established retrieval database, and determining similarity values of the retrieved similar statements and/or similar paragraph text and the retrieval statement, including:
and taking the retrieval statement vector as a retrieval object, retrieving in a pre-established retrieval database to obtain similar statements and/or similar paragraph texts, and determining similarity values of the similar statements and/or similar paragraph texts and the retrieval statement according to the calculated similarity distance between the similar statements and/or similar paragraph texts and the retrieval statement.
Optionally, taking the search statement vector as a search object, searching in a pre-established search database, and determining a similarity value between the similar statement and/or the similar paragraph text and the search statement, includes:
determining the entry of the retrieval statement vector in the retrieval database according to a preset index mode by taking the retrieval statement vector as the input of a retrieval object;
calculating similarity distances between all statement vectors in the entries and the adjacent entries and the retrieval statement vector;
sequencing the obtained similarity distances from small to large, and acquiring similarity distances corresponding to a preset number of similar sentences with small similarity distances in a sequencing result;
and converting the similarity distance between the similar sentence and/or the similar paragraph text and the retrieval sentence into a similarity value.
Optionally, vectorizing at least one of the search statements to obtain a search statement vector, where the vectorizing includes:
performing word segmentation processing on the retrieval sentence according to a preset word segmentation method, and performing vectorization processing on the word segmentation to obtain a word segmentation vector;
and weighting and summing the word segmentation vectors, the word frequency of the word segmentation in the technical cross-bottom text and the inverse document frequency to obtain the retrieval statement vectors.
Optionally, adding similar paragraph texts and/or similar sentences with similarity values higher than the similarity threshold as reference texts similar to the paragraph texts which may not be innovative to the technical background text, including:
and taking similar paragraph texts and/or similar sentences with similarity values higher than a similarity threshold value as reference texts similar to the possibly non-innovative paragraph texts, wherein the reference texts are associated with the possibly non-innovative paragraph texts in any one or more ways of the following ways and are added into the technical background texts:
annotating mode, labeling mode, revising mode and annotation mode.
Optionally, the determining at least one paragraph text that may not be innovative in the technical background text includes:
judging whether paragraph texts in the technical cross-bottom texts are marked with marks without innovativeness;
when the mark is included, determining the paragraph texts marked with non-innovative representations as the paragraph texts possibly without innovativeness;
and when the mark is not included, performing semantic analysis on all paragraph texts in the technical background text, and determining the paragraph texts which possibly do not have innovativeness according to the analysis result.
Optionally, the semantic analysis is performed on all the paragraph texts in the technical background text, and the paragraph texts which may not have innovativeness are determined through an analysis result, including:
comparing all paragraph texts in the technical cross-bottom text with background technical texts in the technical cross-bottom text, and if the paragraph texts do not contain technical effect sentence texts in a preset language library, determining the paragraph texts as the paragraph texts possibly without innovation; or the like, or, alternatively,
searching all paragraph texts in the technical cross-bottom text for a paragraph text containing a technical effect statement text, and if the paragraph text does not contain the technical effect statement text in a preset language library, determining the paragraph text as a paragraph text possibly not having innovation; or the like, or, alternatively,
comparing the paragraph texts in the technical background texts with the paragraph texts in a preset database, determining the similarity between the paragraph texts in the technical background and the paragraph texts in the database, and determining the paragraph texts with the similarity higher than a preset threshold as the paragraph texts which may not have innovation.
In a second aspect, an embodiment of the present invention provides a device for expanding technical background text, which may include:
a determining module for determining at least one paragraph text that may not be innovative in the technical cross-under text;
the retrieval module is used for retrieving in a pre-established retrieval database by taking at least one retrieval statement contained in at least one possibly non-innovative paragraph text as a retrieval object, and determining the similarity value of the retrieved similar statement and/or similar paragraph text and the retrieval statement;
and the adding module is used for comparing the similarity value with a preset similarity threshold value, and adding the similar paragraph text and/or the similar sentence with the similarity value higher than the similarity threshold value into the technical background text as a reference text similar to the paragraph text which possibly does not have innovation.
In a third aspect, an embodiment of the present invention provides another method for expanding a technical background text, where the method may include:
determining at least one paragraph text that may not be innovative in the technical cross-under text;
searching the keywords in at least one paragraph text which possibly does not have innovation according to a preset searching mode to determine core keywords;
comparing the correlation value between the core keyword and other keywords with a preset correlation threshold value, and determining other keywords larger than the preset correlation threshold value as extension keywords; or the core keywords and other keywords are arranged in descending order according to the similarity between the core keywords and other keywords, and a preset number of other keywords in the sequencing result are determined as expansion keywords;
constructing a target retrieval feature sequence of the paragraph text which possibly has no innovation according to the core keywords and the extension keywords;
taking the target retrieval characteristic sequence as a retrieval object, retrieving in a pre-established relational database, and calculating the similarity value of the target retrieval characteristic sequence and the retrieval characteristic sequence in the pre-established relational database;
and comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity values higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which possibly do not have innovativeness.
In a fourth aspect, an embodiment of the present invention provides another technical background text extension apparatus, which may include:
a first determining module for determining at least one paragraph text that may not be innovative in the technical cross-under text;
the second determining module is used for searching the keywords in a preset searching mode by using the keywords contained in at least one paragraph text which possibly does not have innovativeness to determine core keywords;
a third determining module, configured to compare a relevance value between the core keyword and another keyword with a preset relevance threshold, and determine another keyword greater than the preset relevance threshold as an extended keyword;
the building module is used for building a target retrieval feature sequence of the paragraph text which is possibly innovative according to the core keywords and the extension keywords;
the retrieval module is used for retrieving in a pre-established relational database by taking the target retrieval characteristic sequence as a retrieval object and calculating the similarity value of the target retrieval characteristic sequence and the retrieval characteristic sequence in the pre-established relational database;
and the adding module is used for comparing the similarity value with a preset similarity threshold value, and adding the similar paragraph text and/or the similar sentence with the similarity value higher than the similarity threshold value into the technical background text as a reference text similar to the paragraph text which possibly does not have innovation.
In a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which computer-executable instructions are stored, and when executed by a processor, the computer-readable storage medium can implement the above-mentioned technical background text extension method.
In a sixth aspect, an embodiment of the present invention provides a server, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program and is operable to implement the above-mentioned text extension method.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
the method and the device for searching the technical cross-under text determine at least one paragraph text which possibly does not have innovation in the technical cross-under text, then search in a pre-established search database by taking at least one search sentence contained in the at least one paragraph text which possibly does not have innovation as a search object, determine similarity values of the searched similar sentences and/or the similar paragraph texts and the search sentences, and add the similar paragraph text and/or the similar sentences of which the similarity values are higher than a similarity threshold value into the technical cross-under text as reference texts which are similar to the paragraph text which possibly does not have innovation. Compared with the embodiment text expanded in the manual writing mode in the prior art, the embodiment of the invention has the advantages that the retrieval efficiency and quality are increased, a large amount of manpower and material resources are saved, the writing work of related personnel is assisted, and the writing quality and efficiency are further improved.
Furthermore, each retrieval statement in the paragraph text which may not have innovation is used as a retrieval object, so that the search retrieval range is expanded greatly, the expanded reference text is richer, a large number of describable texts are provided to form similar embodiment texts, and a large number of screenable reference text supports are provided for the contents in the technical background text.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a flowchart of a method for expanding a technical background text according to embodiment 1 of the present invention;
fig. 2 is a flowchart of vectorization processing performed on a search statement according to embodiment 1 of the present invention;
FIG. 3 is a flowchart of a specific implementation method of step S12;
fig. 4 is a flowchart of a method for constructing a search database according to embodiment 1 of the present invention;
FIG. 5 is a flowchart of a specific implementation method of step S11;
fig. 6 is a schematic structural diagram of a technical background text expansion device according to embodiment 1 of the present invention;
FIG. 7 is a flowchart illustrating another method for expanding a text according to another embodiment of the present invention;
fig. 8 is a schematic structural diagram of another technical background text expansion apparatus according to embodiment 2 of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to solve the problem that description of an existing similar embodiment or an existing implementation mode is not complete enough when a patent is written manually in the prior art, the embodiment of the invention provides a method, a device and equipment for expanding a technical background text, which can expand the content of an embodiment based on the existing technical background text, complete intelligent generation of the text part of an embodiment of a specification, provide a reference text for relevant personnel, save a large amount of labor cost and improve the efficiency and quality of patent writing.
Example 1
An embodiment 1 of the present invention provides a method for extending a technical background text, which, referring to fig. 1, includes the following steps:
step S11, determining at least one of the technical cross-sectional texts may not have the innovative paragraph text.
Generally, in a technical cross-section text, not all paragraph texts included in the technical cross-section text are innovative, some paragraph texts are used for describing or explaining technical terms in the invention creation, some paragraph texts are used for explaining implementation logic and implementation modes of the invention creation, and of course, some paragraph texts are used for highlighting differences from the prior art or beneficial effects, advances and the like in a description scheme.
It should be noted that the above paragraph texts that may not have innovativeness in the embodiment of the present invention are not necessarily not innovative, and may be considered as non-invasive and new by related personnel of the scheme or considered as non-invasive and new by rough analysis, and the present application is to analyze the above paragraph texts that may not have innovativeness again to achieve the purpose of intelligent expansion.
Step S12, using at least one search sentence contained in at least one paragraph text that may not have innovativeness as a search object, performing search in a pre-established search database, and determining the similarity value between the searched similar sentences and/or similar paragraph texts and the search sentences.
And step S13, comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity value higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which may not have innovativeness.
The method and the device for searching the technical cross-under text determine at least one paragraph text which possibly does not have innovation in the technical cross-under text, then search in a pre-established search database by taking at least one search sentence contained in the at least one paragraph text which possibly does not have innovation as a search object, determine similarity values of the searched similar sentences and/or the similar paragraph texts and the search sentences, and add the similar paragraph text and/or the similar sentences of which the similarity values are higher than a similarity threshold value into the technical cross-under text as reference texts which are similar to the paragraph text which possibly does not have innovation. Compared with the embodiment text expanded in the manual writing mode in the prior art, the embodiment of the invention has the advantages that the retrieval efficiency and quality are increased, a large amount of manpower and material resources are saved, the writing work of related personnel is assisted, and the writing quality and efficiency are further improved.
Furthermore, each retrieval statement in the paragraph text which may not have innovation is used as a retrieval object, so that the search retrieval range is expanded greatly, the expanded reference text is richer, a large number of describable texts are provided to form similar embodiment texts, and a large number of screenable reference text supports are provided for the contents in the technical background text.
In an alternative embodiment, before performing step S12, at least one of the search sentences needs to be vectorized to obtain a search sentence vector. Correspondingly, the retrieval statement vector is used for retrieving in a retrieval database established in advance at a retrieval object, and the similarity value of the retrieved similar statement and/or similar paragraph text and the retrieval statement is determined.
The embodiment of the invention can quickly search in the pre-established search database by vectorizing the search sentences so as to obtain the similar sentences and/or the similar paragraph texts, thereby accelerating the search efficiency and improving the search accuracy.
In an alternative embodiment, a specific implementation manner of the retrieval statement vectorization processing is as follows, for example, a plurality of statements provided in the embodiment of the present invention are as follows:
statement A: "a correlation operation circuit in pseudo random code ranging. ";
statement B: "its typical application is to use an m-sequence as a pseudo-random code. ";
statement C: ' connecting device for USB interface and storage device of storage device. ";
statement D: wherein the interface connector is designed as the above-mentioned connecting device. ";
statement E: "invoke server for scheduled videoconference. "
The detailed process of vectorizing the search statement in the embodiment of the present invention is as follows, and as shown in fig. 2, the detailed process may include:
and step S21, performing word segmentation processing on the search sentences according to a preset word segmentation method, and performing vectorization processing on the word segmentation to obtain word segmentation vectors.
And step S22, carrying out weighted summation by using the word segmentation vectors, the word frequency of the segmentation in the technical background text and the inverse document frequency to obtain the retrieval statement vectors.
The preset word segmentation method may be an existing word segmentation method in the prior art, such as a character string matching word segmentation algorithm, a machine learning algorithm based on statistics, and the like. After the word segmentation is obtained, vectorizing the word segmentation to obtain a word segmentation vector of the word segmentation, for example, vectorizing the word segmentation as follows: the invention discloses a Word to vector processing method, which comprises the steps of using a FastText Word vector calculation model in the embodiment of the invention, taking words in all technical cross-bottom texts as training input, and outputting Word segmentation vectors of each Word segmentation.
Specifically, the above sentence a is taken as an example for explanation, and the word segmentation of the sentence a is as follows: "a," "at," "pseudo," "random code," "ranging," "in," "correlated," "operating," "circuit," where the "circuit" word vector may be expressed as: [ -0.0529, -0.2667, … …, -0.0355,0.0803], in the present embodiment, vector dimensions, such as 256-dimensional vectors, can be set according to actual requirements.
After the word segmentation vectors are obtained, the sentence vectors are obtained by weighting and summing the word frequency of the word segmentation vectors, the word frequency of the word segmentation in the technical cross-bottom text and the frequency of the inverse document. The TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for intelligence retrieval and text mining to evaluate the degree of repetition of a term with respect to a set of domain documents in a document or corpus (e.g., technical cross-bottom text).
Word frequency of participles in technical cross-bottom text
Figure BDA0002496648110000101
Inverse document frequency of word segmentation in technical cross-bottom text
Figure BDA0002496648110000102
Figure BDA0002496648110000103
Term Frequency (TF) of participles in technical-cross-bottom text and Inverse Document Frequency (IDF) of participles in technical-cross-bottom text.
The above statement A is also used as an example, wherein the word vector of "a" is VA kind ofRepresenting, word frequency-inverse document frequency Using TF-IDFA kind ofRepresents; the word vector uses VIn thatRepresenting, word frequency-inverse document frequency Using TF-IDFIn thatNote that the … … "Circuit" word vector uses VCircuit arrangementRepresenting, word frequency-inverse document frequency Using TF-IDFCircuit arrangementAnd (4) showing. The sentence vector of sentence a is VA kind of*TF-IDFA kind of+VIn that*TF-IDFIn that+……+VCircuit arrangement*TF-IDFCircuit arrangementThe sentence vector in the embodiment of the present invention may also use a vector having the same dimension as the word vector, for example, a 256-dimensional vector used for the word vector.
In an alternative embodiment, the above-mentioned taking the search statement vector as a search object, searching in a pre-established search database, and determining the similarity value between the similar statement and/or the similar paragraph text and the search statement, as shown in fig. 3, may include the following steps:
and step S31, determining an entry of the search term vector in the search database according to a preset indexing mode, with the search term vector as an input of a search target.
The search database in the embodiment of the present invention includes a database formed by training using a preset corpus and a vector index library, and the specific construction method refers to the following description. The search database in the embodiment of the present invention may be a database constructed based on a relational database management system (MySQL), and of course, other types of databases may also be used.
The database includes a plurality of entries, each entry including: the sentence original text, the sentence vector corresponding to the sentence original text and the full text number of the sentence original text. Wherein, the sentence original text is extracted for reference or use for convenience; the statement vector is used for calculating the similarity distance between the statement and the central statement; the full text number of the original sentence is used for ordering, indexing and the like all sentences in the database.
And step S32, calculating similarity distances between all statement vectors in the entry and the adjacent entry and the retrieval statement vector.
In the embodiment of the invention, the similarity distance between the statement vectors and the retrieval statement vector in all the entries and the adjacent entries is calculated by using the conventional distance calculation method. For example, the similarity distance is calculated by using the euclidean distance, which is not particularly limited in the embodiment of the present invention. The smaller the similarity distance, the higher the similarity between the sentence original text and the search sentence in the database.
And step S33, sequencing the obtained similarity distances from small to large, and acquiring the similarity distances corresponding to the preset number of similar sentences with small similarity distances in the sequencing result.
And step S34, converting the similarity distance between the similar sentence and/or the similar paragraph text and the retrieval sentence into a similarity value.
According to the embodiment of the invention, the similarity value of the similar sentence and/or the similar paragraph and the retrieval sentence is determined through the retrieval, and then the similarity value is compared with the preset similarity threshold value. For example, the similarity threshold is set to 65%, the preset number is 5, and if the similarity values of 5 similar sentences and the retrieval sentence are both greater than 65%, the sentence is determined to be the reference text of the paragraph text.
In an alternative embodiment, adding similar paragraph text and/or similar sentences with similarity values higher than the similarity threshold value to the technical background text as reference text similar to paragraph text which may not be innovative includes:
and (3) taking the similar paragraph texts and/or the similar sentences with the similarity values higher than the similarity threshold value as reference texts similar to the paragraph texts possibly without innovations, and associating the reference texts with the paragraph texts possibly without innovations in any one or more ways of the following ways, and adding the reference texts into the technical introduction text:
annotating mode, labeling mode, revising mode and annotation mode.
According to the embodiment of the invention, similar sentences and/or similar paragraphs are intelligently added, so that the reference can be provided for related workers, and further paragraphs and texts which may not have innovativeness can be expanded, and the working quality and the working efficiency are improved.
In a specific embodiment, the search database may be constructed by constructing a database and a vector index database for corpora in advance by using a large amount of existing disclosures such as patent documents, papers, periodicals, and the like, and a specific construction method may be as shown in fig. 4, and may include the following steps:
and step S41, performing word segmentation on the sentences in the preset corpus by using a preset word segmentation method, and performing vectorization on the words to obtain all word segmentation vectors.
And step S42, carrying out weighted summation by using the word segmentation vectors, the word frequency of the segmentation in the preset corpus and the inverse document frequency to obtain the sentence vectors.
In the embodiment of the present invention, specific implementations of the step S41 and the step S42 may refer to examples and descriptions related to the step S21 and the step S22, and are not described herein again.
And step S43, storing the sentence vector, the sentence original text and the full text number corresponding to the sentence original text in a relational database.
For example, the relational database that is maintained may be as shown in table 1 below:
TABLE 1
Figure BDA0002496648110000131
And step S44, constructing a vector index library of the sentence by adopting a preset similar text retrieval algorithm.
The embodiment of the invention adopts an approximate nearest neighbor similar text retrieval algorithm to construct an index database of the database, such as HNSW (Hierarchical NSW algorithm), which is a new method in approximate k neighbor search and is also an improvement on the NSW method, and the method is composed of multilayer neighbor graphs, so that the method is called a Hierarchical NSW method, Faiss (Facebook AI team originated clustering and similarity search library) and other methods. The Faiss method adopted by the embodiment is a framework for providing efficient similarity search for dense vectors, supports the search of hundred million-level vectors, has high retrieval speed, and is one of the most mature approximate neighbor search libraries at present. The input of the algorithm is a vector matrix of sentences and full-text numbers of the sentences in the database, if 10w sentences exist in the database and the vector dimension is 256, the input is a 10 w-256-dimensional vector matrix and the full-text numbers of the corresponding sentences, and the retrieval index is obtained by a Faiss retrieval method. Faiss offers a variety of search methods, such as the indexivflat method: defining a plurality of Voronoi cells in d-dimensional (256) space, and enabling statement vectors in each database to fall into one of the cells, IndexIVFFlat has a training process, and obtaining Faiss retrieval index IndexIVFFlat. The effect of the now used Faiss 'PCA 64, IVF1000, Flat' indexing method in combination with the FastText vector of a statement is that, using the original sentence for the search test, Recall @ Top1 (the first desired result at the time of the search) is 99.7893%, Recall @ Top2 is 99.8883%, and Recall @ Top3 is 99.9863%.
In an alternative embodiment, the implementation of step S11 can be as shown in fig. 5, and may include the following steps:
step S51, judging whether the paragraph text in the technical cross-bottom text is marked with an identification without innovation; when the identifier is included, executing step S52; otherwise, step S53 is executed.
Step S52, determining the paragraph text marked with non-innovative representation as the paragraph text that may not be innovative.
And step S53, when the mark is not included, performing semantic analysis on all paragraph texts in the technical background text, and determining the paragraph texts which may not have innovativeness according to the analysis result.
Specifically, the implementation of step S53 may include the following steps:
comparing all paragraph texts in the technical cross-bottom text with the background technical text in the technical cross-bottom text, and if the paragraph texts do not contain the technical effect sentence texts in the preset language library, determining the paragraph texts as the paragraph texts which possibly do not have innovation; or the like, or, alternatively,
searching all paragraph texts in the technical cross-bottom text for a paragraph text containing a technical effect statement text, and if the paragraph text does not contain the technical effect statement text in a preset language library, determining the paragraph text as a paragraph text possibly not having innovation; or the like, or, alternatively,
comparing the paragraph texts in the technical background texts with the paragraph texts in the preset database, determining the similarity between the paragraph texts in the technical background and the paragraph texts in the database, and determining the paragraph texts with the similarity higher than a preset threshold as the paragraph texts which possibly have no innovation.
It should be noted that, in the embodiment of the present invention, the paragraph texts marked with paragraphs that may not have innovative paragraph marks and the paragraph texts determined by performing semantic analysis on all the paragraph texts in the technical background text that may not have innovativeness may be used as paragraphs that may not have innovativeness in the technical background text, and are searched for expansion, which is not specifically limited in the embodiment of the present invention.
In the embodiment, the paragraph texts which may not have innovativeness are primarily screened and analyzed, and then are retrieved again based on the primary screening result, so that the retrieval is more accurate, the retrieval efficiency is improved, and the content of the embodiment texts is further expanded.
Based on the same inventive concept, an embodiment of the present invention further provides a technical background text extension apparatus, which, as shown in fig. 6, may include: the determining module 61, the retrieving module 62 and the adding module 63 work according to the following principle:
the determination module 61 determines that at least one of the technical texts may not have an innovative passage text.
The retrieval module 62 retrieves at least one retrieval sentence included in at least one possibly non-innovative paragraph text as a retrieval object in a pre-established retrieval database, and determines a similarity value between the retrieved similar sentence and/or similar paragraph text and the retrieval sentence.
The adding module 63 compares the similarity value with a preset similarity threshold, and adds the similar paragraph text and/or the similar sentence with the similarity value higher than the similarity threshold as the reference text similar to the paragraph text which may not have innovativeness to the technical cross-under text.
In an optional embodiment, the retrieval module 62 is further configured to perform vectorization processing on at least one of the retrieval statements to obtain a retrieval statement vector; then, the retrieval module 62 retrieves from a pre-established retrieval database by using the retrieval statement vector as a retrieval object to obtain a similar statement and/or a similar paragraph text, and determines a similarity value between the similar statement and/or the similar paragraph text and the retrieval statement according to the calculated similarity distance between the similar statement and/or the similar paragraph text and the retrieval statement.
In an optional embodiment, the retrieving process specifically includes: the retrieval module 62 takes the retrieval statement vector as an input of a retrieval object, and determines an entry of the retrieval statement vector in the retrieval database according to a preset indexing mode; the retrieval module 62 calculates similarity distances between all statement vectors in the entry and the adjacent entry and the retrieval statement vector; the retrieval module 62 sorts the obtained similarity distances from small to large, and obtains similarity distances corresponding to a preset number of similar sentences with small similarity distances in the sorting result; the retrieval module 62 converts the similarity distance between the similar sentences and/or the similar paragraph texts and the retrieval sentence into a similarity value.
In a more specific embodiment, the vectorizing at least one of the search sentences by the search module 62 to obtain a search sentence vector may include: the retrieval module 62 performs word segmentation on the retrieved sentences according to a preset word segmentation method, and performs vectorization on the words to obtain word segmentation vectors; the retrieval module 62 performs weighted summation on the word segmentation vector, the word frequency of the word segmentation in the technical cross-bottom text and the inverse document frequency to obtain the retrieval statement vector.
In an alternative embodiment, the adding module 63 adds, as a reference text similar to the possibly non-innovative paragraph text, a similar paragraph text and/or a similar sentence with a similarity value higher than a similarity threshold to the technical introduction text in association with the possibly non-innovative paragraph text in any one or more of the following ways: annotating mode, labeling mode, revising mode and annotation mode.
In an alternative embodiment, the determining module 61 determines at least one paragraph text that may not be innovative in the technical cross-talk text, including:
the determining module 61 determines whether the paragraph text in the technical cross-bottom text is marked with an identifier without innovation;
when the identification is included, the determining module 61 determines the paragraph texts marked with non-innovative representations as the paragraph texts possibly without innovativeness;
when the identifier is not included, the determining module 61 performs semantic analysis on all the paragraph texts in the technical background text, and determines the paragraph texts which may not have innovativeness according to the analysis result.
More specifically, the determining module 61 determines, through the analysis result, a paragraph text that may not be innovative, including the following ways:
the determining module 61 compares all paragraph texts in the technical cross-under text with the background technical text in the technical cross-under text, and if the paragraph texts do not include the technical effect sentence texts in the preset language library, determines the paragraph texts as the paragraph texts which may not have innovativeness; or the like, or, alternatively,
the determining module 61 searches all the paragraph texts in the technical cross-bottom text for a paragraph text containing a technical effect statement text, and if the paragraph text does not contain the technical effect statement text in a preset language library, determines the paragraph text as a paragraph text which may not have innovation; or the like, or, alternatively,
the determining module 61 compares the paragraph texts in the technical background texts with the paragraph texts in a preset database, determines the similarity between the paragraph texts in the technical background texts and the paragraph texts in the database, and determines the paragraph texts with the similarity higher than a preset threshold as the paragraph texts which may not have innovativeness.
For specific description, beneficial effects and relevant examples of the device according to the embodiment of the present disclosure, reference is made to the above method portion, and details are not repeated herein.
Based on the same inventive concept, the embodiment of the present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-readable storage medium is executed by a processor, the above technical background text extension method can be implemented.
Based on the same inventive concept, the embodiment of the present invention further provides a server, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the server can be used to implement the above-mentioned technical background text extension method.
For specific descriptions, beneficial effects and related examples of the computer-readable storage medium and the server according to the embodiments of the present disclosure, reference is made to the above method portions, and details are not repeated herein.
Example 2
Embodiment 2 of the present invention provides another technical background text extension method, and as shown in fig. 7, the method may include the following steps:
step S71, determining at least one of the technical cross-sectional texts may not have the innovative paragraph text.
For detailed description and specific implementation of this step, reference is made to the content of step S11 in embodiment 1, and details are not repeated here.
And step S72, searching the keywords in at least one paragraph text which possibly does not have innovation according to a preset searching mode to determine the core keywords.
The method comprises the following steps of firstly extracting keywords in the paragraph text which may not be innovative, removing formats in the paragraph text to obtain a plurality of keywords, and for example, extracting the keywords in the paragraph text by using a Bag of words (BoW for short) model. Then, for the extracted keywords, searching is performed according to a preset searching manner to determine the core keywords, for example, searching for the keywords with the highest occurrence frequency, or searching for the keywords with the occurrence frequency accounting for a preset proportion of the occurrence frequencies of all the keywords (for example, the occurrence proportion exceeds 15%), and determining the keywords as the core keywords. In one specific example, suppose the determined core keyword is power, energy saving, timing. It should be noted that there is certainly more than one keyword contained in the above paragraph text, but a plurality of keywords.
Step S73, comparing the correlation value between the core keyword and other keywords with a preset correlation threshold value, and determining other keywords larger than the preset correlation threshold value as expansion keywords; or the core keywords and other keywords are arranged in descending order according to the similarity between the core keywords and other keywords, and the other keywords with preset number in the sequencing result are determined as the expanded keywords.
In this step, vectorization processing is performed on all extracted keywords (including core keywords, of course), for example, vectorization processing is performed on the keywords based on Word embedding (Word embedding, vectorization of Word senses), such as multidimensional vector representation of power supply usage (0.2,05,0.3,0.4,0.3 … …, 0.1).
Secondly, calculating the correlation degree between each core keyword and all keywords, in the embodiment of the invention, the cosine value between the multidimensional vectors is used for calculating the correlation degree between each core keyword and other keywords.
Then, comparing the obtained correlation value with a preset correlation threshold value, and taking other keywords larger than the preset correlation threshold value as extension keywords; or the core keywords and other keywords are arranged in descending order according to the degree of correlation between the core keywords and other keywords, and a preset number of other keywords in the sequencing result are arranged. For example, the number N of words is set, and N other keywords before the ranking are determined as the expanded keywords. For example, the set correlation threshold is 0.6, as shown in table 2:
TABLE 2
Figure BDA0002496648110000181
And step S74, constructing a target retrieval feature sequence of the paragraph text which may not have innovativeness according to the core keywords and the extension keywords.
The extended keywords obtained in step S73 and the core keywords together form a high-dimensional keyword set, and all combinations of the high-dimensional keyword set form a target retrieval feature sequence of the paragraph text.
For example, the target search feature sequence formed by combining all the high-dimensional key phrases in table 2 above may be:
power supply, energy saving, timing (1,1, 1);
power supply, energy saving, on time (1,1, 0.7);
the power supply is energy-saving and is kept at a constant time (1,1, 0.6);
power supply, power saving, timing (1,0.9, 1);
……
the transformer has small energy consumption and is time-keeping (0.6,0.6, 0.6).
And step S75, searching in a pre-established relational database by taking the target searching characteristic sequence as a searching object, and calculating the similarity value between the target searching characteristic sequence and the searching characteristic sequence in the pre-established relational database.
It is understood that the target search feature sequence obtained in step S74 is not limited to one, and the similarity value between the target search feature sequence and the search feature sequence in the previously established relational database is calculated using all the target search feature sequences as search targets. Of course, when performing the calculation, iterative training may also be performed on the core keywords, the expanded keywords, and the correlation thereof, so as to find the closest similar text.
In the relational database pre-established in the embodiment of the present invention, as in the retrieval database in embodiment 1, the structure of the paragraph text original text (or the sentence text), the full text number of the paragraph text original text, the high-dimensional key phrase of the paragraph text, and the retrieval feature sequence corresponding to the high-dimensional key phrase may be stored according to a frame preset in the database.
In the step of calculating the similarity, reference may be made to the detailed description in steps S31 to S34 in embodiment 1, and this embodiment is not repeated herein.
And step S76, comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity value higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which may not have innovativeness.
The specific implementation manner of this step refers to the specific description in step S13 in embodiment 1, and this embodiment is not described herein again.
The method and the device for searching the similar sentences in the technical cross-bottom text determine the core keywords and the extension keywords based on the keywords in at least one paragraph text possibly not having innovation in the technical cross-bottom text, further construct a target searching characteristic sequence of the paragraph text possibly having innovation, then perform searching analysis, determine the similarity values of the searched similar sentences and/or the similar paragraph text and the searched sentences, and add the similar paragraph text and/or the similar sentences with the similarity values higher than the similarity threshold value into the technical cross-bottom text as the reference text similar to the paragraph text possibly not having innovation. Compared with the method for expanding the embodiment text in the manual writing mode in the prior art, the method not only increases the retrieval efficiency and quality, saves a large amount of manpower and material resources, assists the writing work of related personnel, and further improves the writing quality and efficiency.
Based on the same inventive concept, an embodiment of the present invention further provides another technical background text extension apparatus, and as shown in fig. 8, the apparatus may include: the first determining module 81, the second determining module 82, the third determining module 83, the constructing module 84, the retrieving module 85 and the adding module 86 work on the following principle:
the first determination module 81 determines at least one paragraph text that may not be innovative in the technical cross-under text;
the second determining module 82 searches for the keywords included in at least one paragraph text which may not have innovativeness in a preset search mode to determine core keywords;
the third determining module 83 compares the correlation value between the core keyword and the other keywords with a preset correlation threshold, and determines the other keywords larger than the preset correlation threshold as the expanded keywords;
the construction module 84 constructs a target retrieval feature sequence of the paragraph text which may be innovative according to the core keyword and the extension keyword;
the retrieval module 85 takes the target retrieval feature sequence as a retrieval object, performs retrieval in a pre-established relational database, and calculates a similarity value between the target retrieval feature sequence and the retrieval feature sequence in the pre-established relational database;
the adding module 86 compares the similarity value with a preset similarity threshold, and adds the similar paragraph text and/or the similar sentence with the similarity value higher than the similarity threshold as the reference text similar to the paragraph text which may not be innovative to the technical cross-under text.
For specific description, beneficial effects and relevant examples of the device according to the embodiment of the present disclosure, reference is made to the above method portion, and details are not repeated herein.
Based on the same inventive concept, the embodiment of the present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-readable storage medium is executed by a processor, the above technical background text extension method can be implemented.
Based on the same inventive concept, the embodiment of the present invention further provides a server, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the server can be used to implement the above-mentioned technical background text extension method.
For specific descriptions, beneficial effects and related examples of the computer-readable storage medium and the server according to the embodiments of the present disclosure, reference is made to the above method portions, and details are not repeated herein.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (11)

1. A method for expanding a technical background text, comprising:
determining at least one paragraph text that may not be innovative in the technical cross-under text;
taking at least one retrieval statement contained in at least one paragraph text which possibly does not have innovation as a retrieval object, retrieving in a pre-established retrieval database, and determining the similarity value of the retrieved similar statement and/or similar paragraph text and the retrieval statement;
and comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity values higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which possibly do not have innovativeness.
2. The method of claim 1, wherein:
taking at least one retrieval statement contained in the paragraph text which may not be innovative as a retrieval object, and performing retrieval in a pre-established retrieval database, wherein the method further comprises the following steps:
vectorizing at least one retrieval statement to obtain a retrieval statement vector;
taking at least one retrieval statement contained in the paragraph text which may not have innovativeness as a retrieval object, retrieving in a pre-established retrieval database, and determining similarity values of the retrieved similar statements and/or similar paragraph text and the retrieval statement, including:
and taking the retrieval statement vector as a retrieval object, retrieving in a pre-established retrieval database to obtain similar statements and/or similar paragraph texts, and determining similarity values of the similar statements and/or similar paragraph texts and the retrieval statement according to the calculated similarity distance between the similar statements and/or similar paragraph texts and the retrieval statement.
3. The method according to claim 2, wherein the searching in a pre-established search database with the search statement vector as a search object to determine the similarity value between the similar statement and/or similar paragraph text and the search statement comprises:
determining the entry of the retrieval statement vector in the retrieval database according to a preset index mode by taking the retrieval statement vector as the input of a retrieval object;
calculating similarity distances between all statement vectors in the entries and the adjacent entries and the retrieval statement vector;
sequencing the obtained similarity distances from small to large, and acquiring similarity distances corresponding to a preset number of similar sentences with small similarity distances in a sequencing result;
and converting the similarity distance between the similar sentence and/or the similar paragraph text and the retrieval sentence into a similarity value.
4. The method of claim 2, wherein vectorizing at least one of the search sentences to obtain a search sentence vector comprises:
performing word segmentation processing on the retrieval sentence according to a preset word segmentation method, and performing vectorization processing on the word segmentation to obtain a word segmentation vector;
and weighting and summing the word segmentation vectors, the word frequency of the word segmentation in the technical cross-bottom text and the inverse document frequency to obtain the retrieval statement vectors.
5. The method of claim 1, wherein adding similar paragraph text and/or similar sentences with similarity values higher than a similarity threshold as reference text similar to the paragraph text which may not be innovative to the technical cross text comprises:
and taking similar paragraph texts and/or similar sentences with similarity values higher than a similarity threshold value as reference texts similar to the possibly non-innovative paragraph texts, wherein the reference texts are associated with the possibly non-innovative paragraph texts in any one or more ways of the following ways and are added into the technical background texts:
annotating mode, labeling mode, revising mode and annotation mode.
6. The method as claimed in any one of claims 1 to 5, wherein the determining at least one paragraph text that may not be innovative in the technical background text comprises:
judging whether paragraph texts in the technical cross-bottom texts are marked with marks without innovativeness;
when the mark is included, determining the paragraph texts marked with non-innovative representations as the paragraph texts possibly without innovativeness;
and when the mark is not included, performing semantic analysis on all paragraph texts in the technical background text, and determining the paragraph texts which possibly do not have innovativeness according to the analysis result.
7. The method as claimed in claim 6, wherein the semantic analysis of all the paragraph texts in the technical background text and the determination of the paragraph texts which may not be innovative by the analysis result comprise:
comparing all paragraph texts in the technical cross-bottom text with background technical texts in the technical cross-bottom text, and if the paragraph texts do not contain technical effect sentence texts in a preset language library, determining the paragraph texts as the paragraph texts possibly without innovation; or the like, or, alternatively,
searching all paragraph texts in the technical cross-bottom text for a paragraph text containing a technical effect statement text, and if the paragraph text does not contain the technical effect statement text in a preset language library, determining the paragraph text as a paragraph text possibly not having innovation; or the like, or, alternatively,
comparing the paragraph texts in the technical background texts with the paragraph texts in a preset database, determining the similarity between the paragraph texts in the technical background and the paragraph texts in the database, and determining the paragraph texts with the similarity higher than a preset threshold as the paragraph texts which may not have innovation.
8. A method for expanding a technical background text, comprising:
determining at least one paragraph text that may not be innovative in the technical cross-under text;
searching the keywords in at least one paragraph text which possibly does not have innovation according to a preset searching mode to determine core keywords;
comparing the correlation value between the core keyword and other keywords with a preset correlation threshold value, and determining other keywords larger than the preset correlation threshold value as extension keywords; or the core keywords and other keywords are arranged in descending order according to the similarity between the core keywords and other keywords, and a preset number of other keywords in the sequencing result are determined as expansion keywords;
constructing a target retrieval feature sequence of the paragraph text which possibly has no innovation according to the core keywords and the extension keywords;
taking the target retrieval characteristic sequence as a retrieval object, retrieving in a pre-established relational database, and calculating the similarity value of the target retrieval characteristic sequence and the retrieval characteristic sequence in the pre-established relational database;
and comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity values higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which possibly do not have innovativeness.
9. A device for extending technical background text, comprising:
a determining module for determining at least one paragraph text that may not be innovative in the technical cross-under text;
the retrieval module is used for retrieving in a pre-established retrieval database by taking at least one retrieval statement contained in at least one possibly non-innovative paragraph text as a retrieval object, and determining the similarity value of the retrieved similar statement and/or similar paragraph text and the retrieval statement;
and the adding module is used for comparing the similarity value with a preset similarity threshold value, and adding the similar paragraph text and/or the similar sentence with the similarity value higher than the similarity threshold value into the technical background text as a reference text similar to the paragraph text which possibly does not have innovation.
10. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement the method of text augmentation as claimed in any one of claims 1 to 8.
11. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being operable to implement the method of extending technical background text as claimed in any one of claims 1 to 8 when executing the program.
CN202010420142.9A 2020-03-19 2020-05-18 Method, device and equipment for expanding technical background text Pending CN111753066A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010196520 2020-03-19
CN202010196520X 2020-03-19

Publications (1)

Publication Number Publication Date
CN111753066A true CN111753066A (en) 2020-10-09

Family

ID=72673235

Family Applications (6)

Application Number Title Priority Date Filing Date
CN202010421277.7A Pending CN111753514A (en) 2020-03-19 2020-05-18 Automatic generation method and device of patent application text
CN202010420151.8A Active CN111756689B (en) 2020-03-19 2020-05-18 System and method for generating patent application file
CN202010421279.6A Pending CN111753067A (en) 2020-03-19 2020-05-18 Innovative assessment method, device and equipment for technical background text
CN202010420143.3A Pending CN111753535A (en) 2020-03-19 2020-05-18 Method and device for generating patent application text
CN202010420142.9A Pending CN111753066A (en) 2020-03-19 2020-05-18 Method, device and equipment for expanding technical background text
CN202010421278.1A Pending CN111753536A (en) 2020-03-19 2020-05-18 Automatic patent application text writing method and device

Family Applications Before (4)

Application Number Title Priority Date Filing Date
CN202010421277.7A Pending CN111753514A (en) 2020-03-19 2020-05-18 Automatic generation method and device of patent application text
CN202010420151.8A Active CN111756689B (en) 2020-03-19 2020-05-18 System and method for generating patent application file
CN202010421279.6A Pending CN111753067A (en) 2020-03-19 2020-05-18 Innovative assessment method, device and equipment for technical background text
CN202010420143.3A Pending CN111753535A (en) 2020-03-19 2020-05-18 Method and device for generating patent application text

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010421278.1A Pending CN111753536A (en) 2020-03-19 2020-05-18 Automatic patent application text writing method and device

Country Status (1)

Country Link
CN (6) CN111753514A (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686639B (en) * 2021-01-05 2022-11-08 河北冀联人力资源服务集团有限公司 Labor contract determination method and system based on deep learning
CN116010603A (en) * 2023-01-31 2023-04-25 浙江中电远为科技有限公司 Feature clustering dimension reduction method for commercial text classification

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528836A (en) * 2016-11-22 2017-03-22 北京恒冠网络数据处理有限公司 Method and device for compiling patent background technology based on big data
CN109766429A (en) * 2019-02-19 2019-05-17 北京奇艺世纪科技有限公司 A kind of sentence retrieval method and device

Family Cites Families (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8041739B2 (en) * 2001-08-31 2011-10-18 Jinan Glasgow Automated system and method for patent drafting and technology assessment
US7707039B2 (en) * 2004-02-15 2010-04-27 Exbiblio B.V. Automatic modification of web pages
US20170098290A1 (en) * 2005-12-14 2017-04-06 Harold W. Milton, Jr. System for preparing a patent application
TWI464601B (en) * 2006-12-22 2014-12-11 Hon Hai Prec Ind Co Ltd System and method for creating patent application files
CN101488164A (en) * 2008-10-10 2009-07-22 亿维讯软件(北京)有限公司 Method for generating patent application files related to invention creation
CN106155989A (en) * 2015-04-03 2016-11-23 北京中知智慧科技有限公司 Patent document generates method and apparatus
CN104809106A (en) * 2015-05-15 2015-07-29 合肥汇众知识产权管理有限公司 System and method for excavating patent schemes
CN105956955A (en) * 2016-05-06 2016-09-21 长沙市麓智信息科技有限公司 Case tracking interaction system and method
CN105956119A (en) * 2016-05-06 2016-09-21 长沙市麓智信息科技有限公司 Patent write auxiliary system and method
CN105930316A (en) * 2016-05-06 2016-09-07 长沙市麓智信息科技有限公司 Patent writing assistance system and assistance method therefor
CN106021207A (en) * 2016-05-06 2016-10-12 长沙市麓智信息科技有限公司 A patent writing system and method
CN106777193B (en) * 2016-12-23 2020-04-10 李鹏 Method for automatically writing specific manuscript
CN106776519A (en) * 2016-12-26 2017-05-31 北京文先科技有限公司 A kind of self-service methodology of composition of patent and system
CN106940726B (en) * 2017-03-22 2020-09-01 山东大学 Creative automatic generation method and terminal based on knowledge network
CN107133210A (en) * 2017-04-20 2017-09-05 中国科学院上海高等研究院 Scheme document creation method and system
CN107220295B (en) * 2017-04-27 2020-02-07 银江股份有限公司 Searching and mediating strategy recommendation method for human-human contradiction mediating case
CN108416008A (en) * 2018-02-28 2018-08-17 华南理工大学 A kind of BIM product database semantic retrieving methods based on natural language processing
CN108491384A (en) * 2018-03-15 2018-09-04 周慧祥 A kind of auxiliary writing system of patent application document
CN109062877A (en) * 2018-04-24 2018-12-21 筑权网(武汉)科技有限公司 A kind of self-service methodology of composition of patent and system
CN108763486A (en) * 2018-05-30 2018-11-06 湖南写邦科技有限公司 Paper duplicate checking method, terminal and storage medium based on terminal
CN109062937B (en) * 2018-06-15 2019-11-26 北京百度网讯科技有限公司 The method of training description text generation model, the method and device for generating description text
CN108845991A (en) * 2018-06-28 2018-11-20 河北国瑞企业管理咨询有限公司 A kind of intra-company's patent duplicate checking method
CN108932220A (en) * 2018-06-29 2018-12-04 北京百度网讯科技有限公司 article generation method and device
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN109522537A (en) * 2018-11-16 2019-03-26 合肥汇创知识产权代理有限公司 Patent writing and application software for XRF analysis
CN109635284A (en) * 2018-11-26 2019-04-16 北京邮电大学 Text snippet method and system based on deep learning associate cumulation attention mechanism
CN109376350A (en) * 2018-12-15 2019-02-22 长沙贤正益祥机械科技有限公司 A kind of semi-automatic methodology of composition of structure class product patent, server and system
CN109766537A (en) * 2019-01-16 2019-05-17 北京未名复众科技有限公司 Study abroad document methodology of composition, device and electronic equipment
CN110413986B (en) * 2019-04-12 2023-08-29 上海晏鼠计算机技术股份有限公司 Text clustering multi-document automatic summarization method and system for improving word vector model
CN110502632A (en) * 2019-07-19 2019-11-26 平安科技(深圳)有限公司 Contract terms reviewing method, device, computer equipment and storage medium based on clustering algorithm
CN110457690A (en) * 2019-07-26 2019-11-15 南京邮电大学 A kind of judgment method of patent creativeness
CN110427884B (en) * 2019-08-01 2023-05-09 达而观信息科技(上海)有限公司 Method, device, equipment and storage medium for identifying document chapter structure
CN110532352B (en) * 2019-08-20 2023-10-27 腾讯科技(深圳)有限公司 Text duplication checking method and device, computer readable storage medium and electronic equipment
CN111160870A (en) * 2019-12-31 2020-05-15 洪泰智造(青岛)信息技术有限公司 Patent file generation method, device and system and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528836A (en) * 2016-11-22 2017-03-22 北京恒冠网络数据处理有限公司 Method and device for compiling patent background technology based on big data
CN109766429A (en) * 2019-02-19 2019-05-17 北京奇艺世纪科技有限公司 A kind of sentence retrieval method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CSDN: "技术交底书在专利申请文件撰写中的功用", 《HTTPS://BLOG.CSDN.NET/TSEMAEK/ARTICLE/DETAILS/51181049》, 18 April 2016 (2016-04-18), pages 2 - 3 *
沈泳;: "专利文件撰写的完全解构", 中国发明与专利, no. 02, 16 February 2008 (2008-02-16), pages 46 - 49 *

Also Published As

Publication number Publication date
CN111753514A (en) 2020-10-09
CN111753067A (en) 2020-10-09
CN111753535A (en) 2020-10-09
CN111756689A (en) 2020-10-09
CN111756689B (en) 2022-11-22
CN111753536A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
Santra et al. Genetic algorithm and confusion matrix for document clustering
Clinchant et al. Xrce’s participation in wikipedia retrieval, medical image modality classification and ad-hoc retrieval tasks of imageclef 2010
WO2015043075A1 (en) Microblog-oriented emotional entity search system
CN112131872A (en) Document author duplicate name disambiguation method and construction system
CN112307171B (en) Institutional standard retrieval method and system based on power knowledge base and readable storage medium
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
Wang et al. Neural related work summarization with a joint context-driven attention mechanism
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN111753066A (en) Method, device and equipment for expanding technical background text
CN115270738A (en) Method and system for generating newspaper and computer storage medium
CN112051986A (en) Code search recommendation device and method based on open source knowledge
Fan et al. Detecting table region in PDF documents using distant supervision
CN105404677A (en) Tree structure based retrieval method
CN108595413B (en) Answer extraction method based on semantic dependency tree
CN112685440B (en) Structural query information expression method for marking search semantic role
Wu et al. Searching online book documents and analyzing book citations
CN105426490A (en) Tree structure based indexing method
WO2023130688A1 (en) Natural language processing method and apparatus, device, and readable storage medium
CN107491524B (en) Method and device for calculating Chinese word relevance based on Wikipedia concept vector
TWI636370B (en) Establishing chart indexing method and computer program product by text information
CN114996455A (en) News title short text classification method based on double knowledge maps
Li et al. Keyphrase extraction and grouping based on association rules

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination