CN111753066A

CN111753066A - Method, device and equipment for expanding technical background text

Info

Publication number: CN111753066A
Application number: CN202010420142.9A
Authority: CN
Inventors: 刘恺; 张灏; 李强
Original assignee: Beijing Xinju Intellectual Property Co ltd
Current assignee: Beijing Xinju Intellectual Property Co ltd
Priority date: 2020-03-19
Filing date: 2020-05-18
Publication date: 2020-10-09
Also published as: CN111753514A; CN111753067A; CN111753535A; CN111756689A; CN111756689B; CN111753536A

Abstract

The invention discloses a method, a device and equipment for expanding a technical background text, wherein the method comprises the following steps: determining at least one paragraph text which possibly does not have innovation in the technical cross-under text, then taking at least one retrieval sentence contained in the at least one paragraph text which possibly does not have innovation as a retrieval object, retrieving in a pre-established retrieval database, determining similarity values of the retrieved similar sentences and/or the similar paragraph texts and the retrieval sentences, and adding the similar paragraph text and/or the similar sentences with the similarity values higher than a similarity threshold value into the technical cross-under text as reference texts which are similar to the paragraph text which possibly does not have innovation. Compared with the method for expanding the embodiment text in a manual writing mode in the prior art, the method for expanding the embodiment text in the invention not only increases the retrieval efficiency and quality, saves a large amount of manpower and material resources, assists writing work of related personnel, and further improves the writing quality and efficiency.

Description

Method, device and equipment for expanding technical background text

Technical Field

The present invention relates to the field of data processing, and in particular, to a method, an apparatus, and a device for expanding a technical background text.

Background

Patent documents are the largest technical information resource in the world, and statistics on the fact that the patent documents contain 90% -95% of scientific and technical information in the world draws more and more attention as intangible property. For example, the number of patent applications in China in 2019 is 140.1 ten thousand, and the number of patent applications in China is 45.3 ten thousand. However, by 12 months end in 2019, the national patent agency is 2649, and the practicing patent agencies broken through 2 thousands of people, although the growth was 1.9 times and 1.5 times, respectively, compared with 2012. However, compared with the difference between the number of patent applications and the number of patent agents, many applicants still cannot efficiently submit inventions generated in the research and development process to relevant departments for patent application.

Although the technical scheme of the invention is clear for the ordinary applicant, especially the inventor, the writing rules and requirements of the patent are not well known, and it is difficult to independently write a qualified application file. At present, there is no feasible auxiliary way to help the applicant who knows the technical solution but has not experienced enough patent application to form a preliminary application document, and there is no way to help the inexperienced applicant and inventor to quickly establish the concept of patent writing document and grasp the preliminary.

The main symptoms in practical applications are therefore: the patent applicant does not write technical filing texts, nor does a special patent writing system provide convenient patent writing services for the patent applicant, and paragraph texts which may not have innovativeness in the writing process are important text factors for generating specific embodiments, and it is particularly important to expand the paragraph texts which may not have innovativeness into contents of specific embodiments, so how to expand the technical filing texts to facilitate intelligent generation of patent application files is a problem that needs to be solved by those skilled in the art urgently.

Disclosure of Invention

In view of the above, the present invention has been made to provide a technical background text extension method, apparatus and device that overcome or at least partially solve the above problems.

In a first aspect, an embodiment of the present invention provides a method for extending a technical background text, which may include:

determining at least one paragraph text that may not be innovative in the technical cross-under text;

taking at least one retrieval statement contained in at least one paragraph text which possibly does not have innovation as a retrieval object, retrieving in a pre-established retrieval database, and determining the similarity value of the retrieved similar statement and/or similar paragraph text and the retrieval statement;

and comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity values higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which possibly do not have innovativeness.

Optionally, at least one retrieval statement included in the paragraph text that may not be innovative is used as a retrieval object, and retrieval is performed in a pre-established retrieval database, which further includes:

vectorizing at least one retrieval statement to obtain a retrieval statement vector;

taking at least one retrieval statement contained in the paragraph text which may not have innovativeness as a retrieval object, retrieving in a pre-established retrieval database, and determining similarity values of the retrieved similar statements and/or similar paragraph text and the retrieval statement, including:

and taking the retrieval statement vector as a retrieval object, retrieving in a pre-established retrieval database to obtain similar statements and/or similar paragraph texts, and determining similarity values of the similar statements and/or similar paragraph texts and the retrieval statement according to the calculated similarity distance between the similar statements and/or similar paragraph texts and the retrieval statement.

Optionally, taking the search statement vector as a search object, searching in a pre-established search database, and determining a similarity value between the similar statement and/or the similar paragraph text and the search statement, includes:

determining the entry of the retrieval statement vector in the retrieval database according to a preset index mode by taking the retrieval statement vector as the input of a retrieval object;

calculating similarity distances between all statement vectors in the entries and the adjacent entries and the retrieval statement vector;

sequencing the obtained similarity distances from small to large, and acquiring similarity distances corresponding to a preset number of similar sentences with small similarity distances in a sequencing result;

and converting the similarity distance between the similar sentence and/or the similar paragraph text and the retrieval sentence into a similarity value.

Optionally, vectorizing at least one of the search statements to obtain a search statement vector, where the vectorizing includes:

performing word segmentation processing on the retrieval sentence according to a preset word segmentation method, and performing vectorization processing on the word segmentation to obtain a word segmentation vector;

and weighting and summing the word segmentation vectors, the word frequency of the word segmentation in the technical cross-bottom text and the inverse document frequency to obtain the retrieval statement vectors.

Optionally, adding similar paragraph texts and/or similar sentences with similarity values higher than the similarity threshold as reference texts similar to the paragraph texts which may not be innovative to the technical background text, including:

and taking similar paragraph texts and/or similar sentences with similarity values higher than a similarity threshold value as reference texts similar to the possibly non-innovative paragraph texts, wherein the reference texts are associated with the possibly non-innovative paragraph texts in any one or more ways of the following ways and are added into the technical background texts:

annotating mode, labeling mode, revising mode and annotation mode.

Optionally, the determining at least one paragraph text that may not be innovative in the technical background text includes:

judging whether paragraph texts in the technical cross-bottom texts are marked with marks without innovativeness;

when the mark is included, determining the paragraph texts marked with non-innovative representations as the paragraph texts possibly without innovativeness;

and when the mark is not included, performing semantic analysis on all paragraph texts in the technical background text, and determining the paragraph texts which possibly do not have innovativeness according to the analysis result.

Optionally, the semantic analysis is performed on all the paragraph texts in the technical background text, and the paragraph texts which may not have innovativeness are determined through an analysis result, including:

comparing all paragraph texts in the technical cross-bottom text with background technical texts in the technical cross-bottom text, and if the paragraph texts do not contain technical effect sentence texts in a preset language library, determining the paragraph texts as the paragraph texts possibly without innovation; or the like, or, alternatively,

searching all paragraph texts in the technical cross-bottom text for a paragraph text containing a technical effect statement text, and if the paragraph text does not contain the technical effect statement text in a preset language library, determining the paragraph text as a paragraph text possibly not having innovation; or the like, or, alternatively,

comparing the paragraph texts in the technical background texts with the paragraph texts in a preset database, determining the similarity between the paragraph texts in the technical background and the paragraph texts in the database, and determining the paragraph texts with the similarity higher than a preset threshold as the paragraph texts which may not have innovation.

In a second aspect, an embodiment of the present invention provides a device for expanding technical background text, which may include:

a determining module for determining at least one paragraph text that may not be innovative in the technical cross-under text;

the retrieval module is used for retrieving in a pre-established retrieval database by taking at least one retrieval statement contained in at least one possibly non-innovative paragraph text as a retrieval object, and determining the similarity value of the retrieved similar statement and/or similar paragraph text and the retrieval statement;

and the adding module is used for comparing the similarity value with a preset similarity threshold value, and adding the similar paragraph text and/or the similar sentence with the similarity value higher than the similarity threshold value into the technical background text as a reference text similar to the paragraph text which possibly does not have innovation.

In a third aspect, an embodiment of the present invention provides another method for expanding a technical background text, where the method may include:

searching the keywords in at least one paragraph text which possibly does not have innovation according to a preset searching mode to determine core keywords;

comparing the correlation value between the core keyword and other keywords with a preset correlation threshold value, and determining other keywords larger than the preset correlation threshold value as extension keywords; or the core keywords and other keywords are arranged in descending order according to the similarity between the core keywords and other keywords, and a preset number of other keywords in the sequencing result are determined as expansion keywords;

constructing a target retrieval feature sequence of the paragraph text which possibly has no innovation according to the core keywords and the extension keywords;

taking the target retrieval characteristic sequence as a retrieval object, retrieving in a pre-established relational database, and calculating the similarity value of the target retrieval characteristic sequence and the retrieval characteristic sequence in the pre-established relational database;

In a fourth aspect, an embodiment of the present invention provides another technical background text extension apparatus, which may include:

a first determining module for determining at least one paragraph text that may not be innovative in the technical cross-under text;

the second determining module is used for searching the keywords in a preset searching mode by using the keywords contained in at least one paragraph text which possibly does not have innovativeness to determine core keywords;

a third determining module, configured to compare a relevance value between the core keyword and another keyword with a preset relevance threshold, and determine another keyword greater than the preset relevance threshold as an extended keyword;

the building module is used for building a target retrieval feature sequence of the paragraph text which is possibly innovative according to the core keywords and the extension keywords;

the retrieval module is used for retrieving in a pre-established relational database by taking the target retrieval characteristic sequence as a retrieval object and calculating the similarity value of the target retrieval characteristic sequence and the retrieval characteristic sequence in the pre-established relational database;

In a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which computer-executable instructions are stored, and when executed by a processor, the computer-readable storage medium can implement the above-mentioned technical background text extension method.

In a sixth aspect, an embodiment of the present invention provides a server, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program and is operable to implement the above-mentioned text extension method.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the method and the device for searching the technical cross-under text determine at least one paragraph text which possibly does not have innovation in the technical cross-under text, then search in a pre-established search database by taking at least one search sentence contained in the at least one paragraph text which possibly does not have innovation as a search object, determine similarity values of the searched similar sentences and/or the similar paragraph texts and the search sentences, and add the similar paragraph text and/or the similar sentences of which the similarity values are higher than a similarity threshold value into the technical cross-under text as reference texts which are similar to the paragraph text which possibly does not have innovation. Compared with the embodiment text expanded in the manual writing mode in the prior art, the embodiment of the invention has the advantages that the retrieval efficiency and quality are increased, a large amount of manpower and material resources are saved, the writing work of related personnel is assisted, and the writing quality and efficiency are further improved.

Furthermore, each retrieval statement in the paragraph text which may not have innovation is used as a retrieval object, so that the search retrieval range is expanded greatly, the expanded reference text is richer, a large number of describable texts are provided to form similar embodiment texts, and a large number of screenable reference text supports are provided for the contents in the technical background text.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of a method for expanding a technical background text according to embodiment 1 of the present invention;

fig. 2 is a flowchart of vectorization processing performed on a search statement according to embodiment 1 of the present invention;

FIG. 3 is a flowchart of a specific implementation method of step S12;

fig. 4 is a flowchart of a method for constructing a search database according to embodiment 1 of the present invention;

FIG. 5 is a flowchart of a specific implementation method of step S11;

fig. 6 is a schematic structural diagram of a technical background text expansion device according to embodiment 1 of the present invention;

FIG. 7 is a flowchart illustrating another method for expanding a text according to another embodiment of the present invention;

fig. 8 is a schematic structural diagram of another technical background text expansion apparatus according to embodiment 2 of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to solve the problem that description of an existing similar embodiment or an existing implementation mode is not complete enough when a patent is written manually in the prior art, the embodiment of the invention provides a method, a device and equipment for expanding a technical background text, which can expand the content of an embodiment based on the existing technical background text, complete intelligent generation of the text part of an embodiment of a specification, provide a reference text for relevant personnel, save a large amount of labor cost and improve the efficiency and quality of patent writing.

Example 1

An embodiment 1 of the present invention provides a method for extending a technical background text, which, referring to fig. 1, includes the following steps:

step S11, determining at least one of the technical cross-sectional texts may not have the innovative paragraph text.

Generally, in a technical cross-section text, not all paragraph texts included in the technical cross-section text are innovative, some paragraph texts are used for describing or explaining technical terms in the invention creation, some paragraph texts are used for explaining implementation logic and implementation modes of the invention creation, and of course, some paragraph texts are used for highlighting differences from the prior art or beneficial effects, advances and the like in a description scheme.

It should be noted that the above paragraph texts that may not have innovativeness in the embodiment of the present invention are not necessarily not innovative, and may be considered as non-invasive and new by related personnel of the scheme or considered as non-invasive and new by rough analysis, and the present application is to analyze the above paragraph texts that may not have innovativeness again to achieve the purpose of intelligent expansion.

Step S12, using at least one search sentence contained in at least one paragraph text that may not have innovativeness as a search object, performing search in a pre-established search database, and determining the similarity value between the searched similar sentences and/or similar paragraph texts and the search sentences.

And step S13, comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity value higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which may not have innovativeness.

In an alternative embodiment, before performing step S12, at least one of the search sentences needs to be vectorized to obtain a search sentence vector. Correspondingly, the retrieval statement vector is used for retrieving in a retrieval database established in advance at a retrieval object, and the similarity value of the retrieved similar statement and/or similar paragraph text and the retrieval statement is determined.

The embodiment of the invention can quickly search in the pre-established search database by vectorizing the search sentences so as to obtain the similar sentences and/or the similar paragraph texts, thereby accelerating the search efficiency and improving the search accuracy.

In an alternative embodiment, a specific implementation manner of the retrieval statement vectorization processing is as follows, for example, a plurality of statements provided in the embodiment of the present invention are as follows:

statement A: "a correlation operation circuit in pseudo random code ranging. ";

statement B: "its typical application is to use an m-sequence as a pseudo-random code. ";

statement C: ' connecting device for USB interface and storage device of storage device. ";

statement D: wherein the interface connector is designed as the above-mentioned connecting device. ";

statement E: "invoke server for scheduled videoconference. "

The detailed process of vectorizing the search statement in the embodiment of the present invention is as follows, and as shown in fig. 2, the detailed process may include:

and step S21, performing word segmentation processing on the search sentences according to a preset word segmentation method, and performing vectorization processing on the word segmentation to obtain word segmentation vectors.

And step S22, carrying out weighted summation by using the word segmentation vectors, the word frequency of the segmentation in the technical background text and the inverse document frequency to obtain the retrieval statement vectors.

The preset word segmentation method may be an existing word segmentation method in the prior art, such as a character string matching word segmentation algorithm, a machine learning algorithm based on statistics, and the like. After the word segmentation is obtained, vectorizing the word segmentation to obtain a word segmentation vector of the word segmentation, for example, vectorizing the word segmentation as follows: the invention discloses a Word to vector processing method, which comprises the steps of using a FastText Word vector calculation model in the embodiment of the invention, taking words in all technical cross-bottom texts as training input, and outputting Word segmentation vectors of each Word segmentation.

Specifically, the above sentence a is taken as an example for explanation, and the word segmentation of the sentence a is as follows: "a," "at," "pseudo," "random code," "ranging," "in," "correlated," "operating," "circuit," where the "circuit" word vector may be expressed as: [ -0.0529, -0.2667, … …, -0.0355,0.0803], in the present embodiment, vector dimensions, such as 256-dimensional vectors, can be set according to actual requirements.

After the word segmentation vectors are obtained, the sentence vectors are obtained by weighting and summing the word frequency of the word segmentation vectors, the word frequency of the word segmentation in the technical cross-bottom text and the frequency of the inverse document. The TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for intelligence retrieval and text mining to evaluate the degree of repetition of a term with respect to a set of domain documents in a document or corpus (e.g., technical cross-bottom text).

Word frequency of participles in technical cross-bottom text

Inverse document frequency of word segmentation in technical cross-bottom text

Term Frequency (TF) of participles in technical-cross-bottom text and Inverse Document Frequency (IDF) of participles in technical-cross-bottom text.

The above statement A is also used as an example, wherein the word vector of "a" is V_{A kind of}Representing, word frequency-inverse document frequency Using TF-IDF_{A kind of}Represents; the word vector uses V_{In that}Representing, word frequency-inverse document frequency Using TF-IDF_{In that}Note that the … … "Circuit" word vector uses V_{Circuit arrangement}Representing, word frequency-inverse document frequency Using TF-IDF_{Circuit arrangement}And (4) showing. The sentence vector of sentence a is V_{A kind of}*TF-IDF_{A kind of}+V_{In that}*TF-IDF_{In that}+……+V_{Circuit arrangement}*TF-IDF_{Circuit arrangement}The sentence vector in the embodiment of the present invention may also use a vector having the same dimension as the word vector, for example, a 256-dimensional vector used for the word vector.

In an alternative embodiment, the above-mentioned taking the search statement vector as a search object, searching in a pre-established search database, and determining the similarity value between the similar statement and/or the similar paragraph text and the search statement, as shown in fig. 3, may include the following steps:

and step S31, determining an entry of the search term vector in the search database according to a preset indexing mode, with the search term vector as an input of a search target.

The search database in the embodiment of the present invention includes a database formed by training using a preset corpus and a vector index library, and the specific construction method refers to the following description. The search database in the embodiment of the present invention may be a database constructed based on a relational database management system (MySQL), and of course, other types of databases may also be used.

The database includes a plurality of entries, each entry including: the sentence original text, the sentence vector corresponding to the sentence original text and the full text number of the sentence original text. Wherein, the sentence original text is extracted for reference or use for convenience; the statement vector is used for calculating the similarity distance between the statement and the central statement; the full text number of the original sentence is used for ordering, indexing and the like all sentences in the database.

And step S32, calculating similarity distances between all statement vectors in the entry and the adjacent entry and the retrieval statement vector.

In the embodiment of the invention, the similarity distance between the statement vectors and the retrieval statement vector in all the entries and the adjacent entries is calculated by using the conventional distance calculation method. For example, the similarity distance is calculated by using the euclidean distance, which is not particularly limited in the embodiment of the present invention. The smaller the similarity distance, the higher the similarity between the sentence original text and the search sentence in the database.

And step S33, sequencing the obtained similarity distances from small to large, and acquiring the similarity distances corresponding to the preset number of similar sentences with small similarity distances in the sequencing result.

And step S34, converting the similarity distance between the similar sentence and/or the similar paragraph text and the retrieval sentence into a similarity value.

According to the embodiment of the invention, the similarity value of the similar sentence and/or the similar paragraph and the retrieval sentence is determined through the retrieval, and then the similarity value is compared with the preset similarity threshold value. For example, the similarity threshold is set to 65%, the preset number is 5, and if the similarity values of 5 similar sentences and the retrieval sentence are both greater than 65%, the sentence is determined to be the reference text of the paragraph text.

In an alternative embodiment, adding similar paragraph text and/or similar sentences with similarity values higher than the similarity threshold value to the technical background text as reference text similar to paragraph text which may not be innovative includes:

and (3) taking the similar paragraph texts and/or the similar sentences with the similarity values higher than the similarity threshold value as reference texts similar to the paragraph texts possibly without innovations, and associating the reference texts with the paragraph texts possibly without innovations in any one or more ways of the following ways, and adding the reference texts into the technical introduction text:

annotating mode, labeling mode, revising mode and annotation mode.

According to the embodiment of the invention, similar sentences and/or similar paragraphs are intelligently added, so that the reference can be provided for related workers, and further paragraphs and texts which may not have innovativeness can be expanded, and the working quality and the working efficiency are improved.

In a specific embodiment, the search database may be constructed by constructing a database and a vector index database for corpora in advance by using a large amount of existing disclosures such as patent documents, papers, periodicals, and the like, and a specific construction method may be as shown in fig. 4, and may include the following steps:

and step S41, performing word segmentation on the sentences in the preset corpus by using a preset word segmentation method, and performing vectorization on the words to obtain all word segmentation vectors.

And step S42, carrying out weighted summation by using the word segmentation vectors, the word frequency of the segmentation in the preset corpus and the inverse document frequency to obtain the sentence vectors.

In the embodiment of the present invention, specific implementations of the step S41 and the step S42 may refer to examples and descriptions related to the step S21 and the step S22, and are not described herein again.

And step S43, storing the sentence vector, the sentence original text and the full text number corresponding to the sentence original text in a relational database.

For example, the relational database that is maintained may be as shown in table 1 below:

TABLE 1

And step S44, constructing a vector index library of the sentence by adopting a preset similar text retrieval algorithm.

The embodiment of the invention adopts an approximate nearest neighbor similar text retrieval algorithm to construct an index database of the database, such as HNSW (Hierarchical NSW algorithm), which is a new method in approximate k neighbor search and is also an improvement on the NSW method, and the method is composed of multilayer neighbor graphs, so that the method is called a Hierarchical NSW method, Faiss (Facebook AI team originated clustering and similarity search library) and other methods. The Faiss method adopted by the embodiment is a framework for providing efficient similarity search for dense vectors, supports the search of hundred million-level vectors, has high retrieval speed, and is one of the most mature approximate neighbor search libraries at present. The input of the algorithm is a vector matrix of sentences and full-text numbers of the sentences in the database, if 10w sentences exist in the database and the vector dimension is 256, the input is a 10 w-256-dimensional vector matrix and the full-text numbers of the corresponding sentences, and the retrieval index is obtained by a Faiss retrieval method. Faiss offers a variety of search methods, such as the indexivflat method: defining a plurality of Voronoi cells in d-dimensional (256) space, and enabling statement vectors in each database to fall into one of the cells, IndexIVFFlat has a training process, and obtaining Faiss retrieval index IndexIVFFlat. The effect of the now used Faiss 'PCA 64, IVF1000, Flat' indexing method in combination with the FastText vector of a statement is that, using the original sentence for the search test, Recall @ Top1 (the first desired result at the time of the search) is 99.7893%, Recall @ Top2 is 99.8883%, and Recall @ Top3 is 99.9863%.

In an alternative embodiment, the implementation of step S11 can be as shown in fig. 5, and may include the following steps:

step S51, judging whether the paragraph text in the technical cross-bottom text is marked with an identification without innovation; when the identifier is included, executing step S52; otherwise, step S53 is executed.

Step S52, determining the paragraph text marked with non-innovative representation as the paragraph text that may not be innovative.

And step S53, when the mark is not included, performing semantic analysis on all paragraph texts in the technical background text, and determining the paragraph texts which may not have innovativeness according to the analysis result.

Specifically, the implementation of step S53 may include the following steps:

comparing all paragraph texts in the technical cross-bottom text with the background technical text in the technical cross-bottom text, and if the paragraph texts do not contain the technical effect sentence texts in the preset language library, determining the paragraph texts as the paragraph texts which possibly do not have innovation; or the like, or, alternatively,

comparing the paragraph texts in the technical background texts with the paragraph texts in the preset database, determining the similarity between the paragraph texts in the technical background and the paragraph texts in the database, and determining the paragraph texts with the similarity higher than a preset threshold as the paragraph texts which possibly have no innovation.

It should be noted that, in the embodiment of the present invention, the paragraph texts marked with paragraphs that may not have innovative paragraph marks and the paragraph texts determined by performing semantic analysis on all the paragraph texts in the technical background text that may not have innovativeness may be used as paragraphs that may not have innovativeness in the technical background text, and are searched for expansion, which is not specifically limited in the embodiment of the present invention.

In the embodiment, the paragraph texts which may not have innovativeness are primarily screened and analyzed, and then are retrieved again based on the primary screening result, so that the retrieval is more accurate, the retrieval efficiency is improved, and the content of the embodiment texts is further expanded.

Based on the same inventive concept, an embodiment of the present invention further provides a technical background text extension apparatus, which, as shown in fig. 6, may include: the determining module 61, the retrieving module 62 and the adding module 63 work according to the following principle:

the determination module 61 determines that at least one of the technical texts may not have an innovative passage text.

The retrieval module 62 retrieves at least one retrieval sentence included in at least one possibly non-innovative paragraph text as a retrieval object in a pre-established retrieval database, and determines a similarity value between the retrieved similar sentence and/or similar paragraph text and the retrieval sentence.

The adding module 63 compares the similarity value with a preset similarity threshold, and adds the similar paragraph text and/or the similar sentence with the similarity value higher than the similarity threshold as the reference text similar to the paragraph text which may not have innovativeness to the technical cross-under text.

In an optional embodiment, the retrieval module 62 is further configured to perform vectorization processing on at least one of the retrieval statements to obtain a retrieval statement vector; then, the retrieval module 62 retrieves from a pre-established retrieval database by using the retrieval statement vector as a retrieval object to obtain a similar statement and/or a similar paragraph text, and determines a similarity value between the similar statement and/or the similar paragraph text and the retrieval statement according to the calculated similarity distance between the similar statement and/or the similar paragraph text and the retrieval statement.

In an optional embodiment, the retrieving process specifically includes: the retrieval module 62 takes the retrieval statement vector as an input of a retrieval object, and determines an entry of the retrieval statement vector in the retrieval database according to a preset indexing mode; the retrieval module 62 calculates similarity distances between all statement vectors in the entry and the adjacent entry and the retrieval statement vector; the retrieval module 62 sorts the obtained similarity distances from small to large, and obtains similarity distances corresponding to a preset number of similar sentences with small similarity distances in the sorting result; the retrieval module 62 converts the similarity distance between the similar sentences and/or the similar paragraph texts and the retrieval sentence into a similarity value.

In a more specific embodiment, the vectorizing at least one of the search sentences by the search module 62 to obtain a search sentence vector may include: the retrieval module 62 performs word segmentation on the retrieved sentences according to a preset word segmentation method, and performs vectorization on the words to obtain word segmentation vectors; the retrieval module 62 performs weighted summation on the word segmentation vector, the word frequency of the word segmentation in the technical cross-bottom text and the inverse document frequency to obtain the retrieval statement vector.

In an alternative embodiment, the adding module 63 adds, as a reference text similar to the possibly non-innovative paragraph text, a similar paragraph text and/or a similar sentence with a similarity value higher than a similarity threshold to the technical introduction text in association with the possibly non-innovative paragraph text in any one or more of the following ways: annotating mode, labeling mode, revising mode and annotation mode.

In an alternative embodiment, the determining module 61 determines at least one paragraph text that may not be innovative in the technical cross-talk text, including:

the determining module 61 determines whether the paragraph text in the technical cross-bottom text is marked with an identifier without innovation;

when the identification is included, the determining module 61 determines the paragraph texts marked with non-innovative representations as the paragraph texts possibly without innovativeness;

when the identifier is not included, the determining module 61 performs semantic analysis on all the paragraph texts in the technical background text, and determines the paragraph texts which may not have innovativeness according to the analysis result.

More specifically, the determining module 61 determines, through the analysis result, a paragraph text that may not be innovative, including the following ways:

the determining module 61 compares all paragraph texts in the technical cross-under text with the background technical text in the technical cross-under text, and if the paragraph texts do not include the technical effect sentence texts in the preset language library, determines the paragraph texts as the paragraph texts which may not have innovativeness; or the like, or, alternatively,

the determining module 61 searches all the paragraph texts in the technical cross-bottom text for a paragraph text containing a technical effect statement text, and if the paragraph text does not contain the technical effect statement text in a preset language library, determines the paragraph text as a paragraph text which may not have innovation; or the like, or, alternatively,

the determining module 61 compares the paragraph texts in the technical background texts with the paragraph texts in a preset database, determines the similarity between the paragraph texts in the technical background texts and the paragraph texts in the database, and determines the paragraph texts with the similarity higher than a preset threshold as the paragraph texts which may not have innovativeness.

For specific description, beneficial effects and relevant examples of the device according to the embodiment of the present disclosure, reference is made to the above method portion, and details are not repeated herein.

Based on the same inventive concept, the embodiment of the present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-readable storage medium is executed by a processor, the above technical background text extension method can be implemented.

Based on the same inventive concept, the embodiment of the present invention further provides a server, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the server can be used to implement the above-mentioned technical background text extension method.

For specific descriptions, beneficial effects and related examples of the computer-readable storage medium and the server according to the embodiments of the present disclosure, reference is made to the above method portions, and details are not repeated herein.

Example 2

Embodiment 2 of the present invention provides another technical background text extension method, and as shown in fig. 7, the method may include the following steps:

step S71, determining at least one of the technical cross-sectional texts may not have the innovative paragraph text.

For detailed description and specific implementation of this step, reference is made to the content of step S11 in embodiment 1, and details are not repeated here.

And step S72, searching the keywords in at least one paragraph text which possibly does not have innovation according to a preset searching mode to determine the core keywords.

The method comprises the following steps of firstly extracting keywords in the paragraph text which may not be innovative, removing formats in the paragraph text to obtain a plurality of keywords, and for example, extracting the keywords in the paragraph text by using a Bag of words (BoW for short) model. Then, for the extracted keywords, searching is performed according to a preset searching manner to determine the core keywords, for example, searching for the keywords with the highest occurrence frequency, or searching for the keywords with the occurrence frequency accounting for a preset proportion of the occurrence frequencies of all the keywords (for example, the occurrence proportion exceeds 15%), and determining the keywords as the core keywords. In one specific example, suppose the determined core keyword is power, energy saving, timing. It should be noted that there is certainly more than one keyword contained in the above paragraph text, but a plurality of keywords.

Step S73, comparing the correlation value between the core keyword and other keywords with a preset correlation threshold value, and determining other keywords larger than the preset correlation threshold value as expansion keywords; or the core keywords and other keywords are arranged in descending order according to the similarity between the core keywords and other keywords, and the other keywords with preset number in the sequencing result are determined as the expanded keywords.

In this step, vectorization processing is performed on all extracted keywords (including core keywords, of course), for example, vectorization processing is performed on the keywords based on Word embedding (Word embedding, vectorization of Word senses), such as multidimensional vector representation of power supply usage (0.2,05,0.3,0.4,0.3 … …, 0.1).

Secondly, calculating the correlation degree between each core keyword and all keywords, in the embodiment of the invention, the cosine value between the multidimensional vectors is used for calculating the correlation degree between each core keyword and other keywords.

Then, comparing the obtained correlation value with a preset correlation threshold value, and taking other keywords larger than the preset correlation threshold value as extension keywords; or the core keywords and other keywords are arranged in descending order according to the degree of correlation between the core keywords and other keywords, and a preset number of other keywords in the sequencing result are arranged. For example, the number N of words is set, and N other keywords before the ranking are determined as the expanded keywords. For example, the set correlation threshold is 0.6, as shown in table 2:

TABLE 2

And step S74, constructing a target retrieval feature sequence of the paragraph text which may not have innovativeness according to the core keywords and the extension keywords.

The extended keywords obtained in step S73 and the core keywords together form a high-dimensional keyword set, and all combinations of the high-dimensional keyword set form a target retrieval feature sequence of the paragraph text.

For example, the target search feature sequence formed by combining all the high-dimensional key phrases in table 2 above may be:

power supply, energy saving, timing (1,1, 1);

power supply, energy saving, on time (1,1, 0.7);

the power supply is energy-saving and is kept at a constant time (1,1, 0.6);

power supply, power saving, timing (1,0.9, 1);

……

the transformer has small energy consumption and is time-keeping (0.6,0.6, 0.6).

And step S75, searching in a pre-established relational database by taking the target searching characteristic sequence as a searching object, and calculating the similarity value between the target searching characteristic sequence and the searching characteristic sequence in the pre-established relational database.

It is understood that the target search feature sequence obtained in step S74 is not limited to one, and the similarity value between the target search feature sequence and the search feature sequence in the previously established relational database is calculated using all the target search feature sequences as search targets. Of course, when performing the calculation, iterative training may also be performed on the core keywords, the expanded keywords, and the correlation thereof, so as to find the closest similar text.

In the relational database pre-established in the embodiment of the present invention, as in the retrieval database in embodiment 1, the structure of the paragraph text original text (or the sentence text), the full text number of the paragraph text original text, the high-dimensional key phrase of the paragraph text, and the retrieval feature sequence corresponding to the high-dimensional key phrase may be stored according to a frame preset in the database.

In the step of calculating the similarity, reference may be made to the detailed description in steps S31 to S34 in embodiment 1, and this embodiment is not repeated herein.

And step S76, comparing the similarity value with a preset similarity threshold, and adding the similar paragraph texts and/or similar sentences with the similarity value higher than the similarity threshold into the technical background texts as reference texts similar to the paragraph texts which may not have innovativeness.

The specific implementation manner of this step refers to the specific description in step S13 in embodiment 1, and this embodiment is not described herein again.

The method and the device for searching the similar sentences in the technical cross-bottom text determine the core keywords and the extension keywords based on the keywords in at least one paragraph text possibly not having innovation in the technical cross-bottom text, further construct a target searching characteristic sequence of the paragraph text possibly having innovation, then perform searching analysis, determine the similarity values of the searched similar sentences and/or the similar paragraph text and the searched sentences, and add the similar paragraph text and/or the similar sentences with the similarity values higher than the similarity threshold value into the technical cross-bottom text as the reference text similar to the paragraph text possibly not having innovation. Compared with the method for expanding the embodiment text in the manual writing mode in the prior art, the method not only increases the retrieval efficiency and quality, saves a large amount of manpower and material resources, assists the writing work of related personnel, and further improves the writing quality and efficiency.

Based on the same inventive concept, an embodiment of the present invention further provides another technical background text extension apparatus, and as shown in fig. 8, the apparatus may include: the first determining module 81, the second determining module 82, the third determining module 83, the constructing module 84, the retrieving module 85 and the adding module 86 work on the following principle:

the first determination module 81 determines at least one paragraph text that may not be innovative in the technical cross-under text;

the second determining module 82 searches for the keywords included in at least one paragraph text which may not have innovativeness in a preset search mode to determine core keywords;

the third determining module 83 compares the correlation value between the core keyword and the other keywords with a preset correlation threshold, and determines the other keywords larger than the preset correlation threshold as the expanded keywords;

the construction module 84 constructs a target retrieval feature sequence of the paragraph text which may be innovative according to the core keyword and the extension keyword;

the retrieval module 85 takes the target retrieval feature sequence as a retrieval object, performs retrieval in a pre-established relational database, and calculates a similarity value between the target retrieval feature sequence and the retrieval feature sequence in the pre-established relational database;

the adding module 86 compares the similarity value with a preset similarity threshold, and adds the similar paragraph text and/or the similar sentence with the similarity value higher than the similarity threshold as the reference text similar to the paragraph text which may not be innovative to the technical cross-under text.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for expanding a technical background text, comprising:

2. The method of claim 1, wherein:

taking at least one retrieval statement contained in the paragraph text which may not be innovative as a retrieval object, and performing retrieval in a pre-established retrieval database, wherein the method further comprises the following steps:

3. The method according to claim 2, wherein the searching in a pre-established search database with the search statement vector as a search object to determine the similarity value between the similar statement and/or similar paragraph text and the search statement comprises:

4. The method of claim 2, wherein vectorizing at least one of the search sentences to obtain a search sentence vector comprises:

5. The method of claim 1, wherein adding similar paragraph text and/or similar sentences with similarity values higher than a similarity threshold as reference text similar to the paragraph text which may not be innovative to the technical cross text comprises:

annotating mode, labeling mode, revising mode and annotation mode.

6. The method as claimed in any one of claims 1 to 5, wherein the determining at least one paragraph text that may not be innovative in the technical background text comprises:

7. The method as claimed in claim 6, wherein the semantic analysis of all the paragraph texts in the technical background text and the determination of the paragraph texts which may not be innovative by the analysis result comprise:

8. A method for expanding a technical background text, comprising:

9. A device for extending technical background text, comprising:

10. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement the method of text augmentation as claimed in any one of claims 1 to 8.

11. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being operable to implement the method of extending technical background text as claimed in any one of claims 1 to 8 when executing the program.