WO2022134575A1 - 业务关键词的提取方法、装置、设备及存储介质 - Google Patents

业务关键词的提取方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022134575A1
WO2022134575A1 PCT/CN2021/109145 CN2021109145W WO2022134575A1 WO 2022134575 A1 WO2022134575 A1 WO 2022134575A1 CN 2021109145 W CN2021109145 W CN 2021109145W WO 2022134575 A1 WO2022134575 A1 WO 2022134575A1
Authority
WO
WIPO (PCT)
Prior art keywords
business
target
word
probability value
text
Prior art date
Application number
PCT/CN2021/109145
Other languages
English (en)
French (fr)
Inventor
赵焕丽
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134575A1 publication Critical patent/WO2022134575A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the field of natural language processing of artificial intelligence, and in particular, to a method, apparatus, device and storage medium for extracting business keywords.
  • NER named entity recognition
  • Named entity recognition refers to the recognition of entities with specific meanings in the text, such as person names and place names, while the named entity recognition task
  • the model uses a lexical character-based method to segment the recognized text and then extract its keywords.
  • the present application provides a method, apparatus, device and storage medium for extracting business keywords, which are used to improve the accuracy of extracting business keywords.
  • a first aspect of the present application provides a method for extracting business keywords, including:
  • the word segmentation information is subjected to word vector conversion to obtain the target word segmentation vector, and the preset neural network model includes an embedded layer and a feature extraction layer;
  • semantic feature extraction and context semantic coding are performed on the target word segmentation vector to obtain target semantic coding features
  • the to-be-processed text information is sequentially classified and keyword extracted to obtain target business keywords.
  • a second aspect of the present application provides a device for extracting business keywords, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes the The computer readable instructions implement the following steps:
  • the word segmentation information is subjected to word vector conversion to obtain the target word segmentation vector, and the preset neural network model includes an embedded layer and a feature extraction layer;
  • semantic feature extraction and context semantic coding are performed on the target word segmentation vector to obtain target semantic coding features
  • the to-be-processed text information is sequentially classified and keyword extracted to obtain target business keywords.
  • a third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps:
  • the word segmentation information is subjected to word vector conversion to obtain the target word segmentation vector, and the preset neural network model includes an embedded layer and a feature extraction layer;
  • semantic feature extraction and context semantic coding are performed on the target word segmentation vector to obtain target semantic coding features
  • the to-be-processed text information is sequentially classified and keyword extracted to obtain target business keywords.
  • a fourth aspect of the present application provides a device for extracting business keywords, including:
  • a matching module used for acquiring the text information to be processed, and performing business vocabulary matching on the text information to be processed by presetting a business dictionary tree to obtain the text business vocabulary
  • a word segmentation module configured to perform word segmentation processing on the to-be-processed text information according to the text business vocabulary to obtain word segmentation information
  • a conversion module configured to perform word vector conversion on the word segmentation information through an embedded layer in a preset neural network model to obtain a target word segmentation vector, where the preset neural network model includes an embedded layer and a feature extraction layer;
  • an encoding module configured to perform semantic feature extraction and context semantic encoding on the target word segmentation vector through the feature extraction layer to obtain target semantic encoding features
  • the extraction module is configured to sequentially classify and extract keywords from the text information to be processed according to the target semantic coding feature to obtain target business keywords.
  • the text information to be processed is obtained, and business vocabulary matching is performed on the text information to be processed by presetting the business dictionary tree to obtain the text business vocabulary; according to the text business vocabulary, the word segmentation processing is performed on the text information to be processed to obtain the word segmentation information; through the embedded layer in the preset neural network model, the word segmentation information is converted into the word vector to obtain the target word segmentation vector.
  • the preset neural network model includes the embedding layer and the feature extraction layer; through the feature extraction layer, the target word segmentation vector is processed. Semantic feature extraction and context semantic coding are performed to obtain target semantic coding features; according to the target semantic coding features, the text information to be processed is classified and keyword extracted in turn to obtain target business keywords.
  • the word segmentation processing of the text business vocabulary by adopting the business vocabulary matching of the preset business dictionary tree, the word segmentation processing of the text business vocabulary, the semantic feature extraction and context semantic encoding of the target word segmentation vector, and the classification and keyword extraction of the target semantic encoding feature, combined with
  • the lexical boundary of the business vocabulary is used to segment the text information to be processed, and the matching accuracy of the business vocabulary of the to-be-processed text information is improved, thereby improving the extraction accuracy of business keywords.
  • FIG. 1 is a schematic diagram of an embodiment of a method for extracting business keywords in an embodiment of the application
  • FIG. 2 is a schematic diagram of another embodiment of the method for extracting business keywords in the embodiment of the present application.
  • FIG. 3 is a schematic diagram of an embodiment of an apparatus for extracting business keywords in an embodiment of the present application
  • FIG. 4 is a schematic diagram of another embodiment of an apparatus for extracting business keywords in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an embodiment of a device for extracting business keywords in an embodiment of the present application.
  • the embodiments of the present application provide a method, apparatus, device and storage medium for extracting business keywords, which improve the accuracy of extracting business keywords.
  • an embodiment of the method for extracting business keywords in the embodiment of the present application includes:
  • the execution subject of the present application may be an apparatus for extracting business keywords, and may also be a terminal or a server, which is not specifically limited here.
  • the embodiments of the present application take the server as an execution subject as an example for description.
  • the server calls the preset voice collector to collect the voice information input by the user, performs voice recognition and text conversion on the voice information through the preset voice recognition model, obtains the recognized text, and detects whether the recognized text has data missing. Perform missing value filling to obtain the processed recognition text, and perform security measurement on the processed recognition text to obtain the text information to be processed. If not, directly perform security measurement on the recognized text to obtain the text information to be processed.
  • the server may receive the text information input from the preset interface, thereby obtaining the text information to be processed.
  • the server can create the target key of the text information to be processed, traverse the preset business dictionary tree through the target key, and obtain the corresponding text business vocabulary from the preset business dictionary tree by matching; the server can also use the preset nearest common ancestor algorithm and multiplication Algorithm to match the text business vocabulary corresponding to the text information to be processed from the preset business dictionary tree.
  • the server performs token token replacement processing on the characters corresponding to the text business vocabulary in the text information to be processed, obtains the initial text information, and divides the characters in the initial text information into word segmentation to obtain the word segmentation information, for example: take the text information to be processed as "I I have personal and insurance under my name, and I have paid for it for 3 years" as an example, the text business word is "ren and insurance", and the text business word "ren and insurance” token is replaced to obtain the initial text information.
  • the characters are divided into single characters, and the word segmentation information "I/name/under/you/individual/person and insurance/, /paid/lai/3/year/date" is obtained.
  • the server after the server divides the characters in the initial text information into single characters, it can perform grammar detection and sensitive word judgment on the divided words, and determine the word segmentation that conforms to the grammar and is a non-sensitive word as word segmentation information;
  • the part-of-speech filtering rule performs part-of-speech filtering on the segmented words to obtain word segmentation information.
  • the preset neural network model includes an embedding layer and a feature extraction layer.
  • the preset neural network model includes an embedding layer and a feature extraction layer, and the embedding layer embedding uses a pre-trained word vector.
  • the server obtains the business word vector corresponding to the text business word, and determines whether there is a word vector consistent with the business word vector in the pre-trained word vector in the embedding layer, and if so, maps the word segmentation information to the preset dimension corresponding to the business word vector space to obtain the target word segmentation vector, if not, map the word segmentation information to the dimension space corresponding to the preset word vector to obtain the target word segmentation vector, for example: if there is a business word vector in the pre-trained word vector in the embedding layer " "Renhe Insurance”, the word segmentation information is mapped to the preset dimension space corresponding to the business vocabulary vector, so as to obtain the target word segmentation vector "Renhe Insurance”, if there is no business word vector in the pre-trained word vector in the embedding layer. "People and Insurance”, the word segment
  • the server may also map the word segmentation information to a preset dimension space through the word vector pre-trained in the embedding layer, so as to obtain a word vector, where the word vector includes a word vector and a word vector, and obtain The text business word vector corresponding to the text business word in the word segmentation information, calculate the cosine distance value between the text business word vector and the word vector, and determine whether each cosine distance value is greater than the preset target value.
  • the vector corresponding to the business vocabulary vector, and the word vector is determined as the target word segmentation vector. If not, it is determined that there is no vector corresponding to the text business vocabulary vector in the word vector, and the vector corresponding to the text business vocabulary position in the word vector is replaced. is the text business vocabulary vector, so as to obtain the target word segmentation vector.
  • the feature extraction layer may be a bi-directional long short-term memory (BiLSTM), a convolutional neural network (CNN), and/or a voltage transformer, etc.
  • the feature extraction layer has Universality, its network structure is not limited.
  • the initial semantic features are obtained, the dimensionality reduction processing is performed on the initial semantic features, the candidate semantic features are obtained, the context vectors of the candidate semantic features are extracted, and the candidate semantic features are encoded based on the context vector through the preset semantic coding model, and the target semantic features are obtained. encoding features.
  • the server uses the classifier in the preset neural network model to classify the target semantic coding features by business vocabulary, obtains multiple classification values, sorts the multiple classification values in descending order, and sorts the text information to be processed.
  • the word corresponding to the first classification value is determined as the target word
  • the target word is extracted from the text information to be processed
  • the target words are combined according to the sequence corresponding to the text information to be processed to obtain the target business keyword
  • the model can include one or more classifiers. If the output layer includes multiple classifiers, the network structure of the multiple classifiers can be a network structure connected in parallel, that is, the same input can also be connected according to a preset connection method. The connected network structure, that is, the input of the next classifier can be the output of the previous classifier.
  • the word segmentation processing of the text business vocabulary by adopting the business vocabulary matching of the preset business dictionary tree, the word segmentation processing of the text business vocabulary, the semantic feature extraction and context semantic encoding of the target word segmentation vector, and the classification and keyword extraction of the target semantic encoding feature, combined with
  • the lexical boundary of the business vocabulary is used to segment the text information to be processed, and the matching accuracy of the business vocabulary of the to-be-processed text information is improved, thereby improving the extraction accuracy of business keywords.
  • FIG. 2 another embodiment of the method for extracting business keywords in the embodiment of the present application includes:
  • step 201 The execution process of this step 201 is similar to the execution process of the above-mentioned step 101, and details are not repeated here.
  • the server obtains the text information to be processed, performs business vocabulary matching on the text information to be processed by presetting a business dictionary tree, obtains a business vocabulary set before obtaining the text business vocabulary, and calculates the word frequency-inverse of each business vocabulary in the business vocabulary set Text frequency index value; sort the business vocabulary set according to the word frequency-inverse text frequency index value to obtain a business vocabulary sequence; perform string segmentation on the business vocabulary sequence to obtain a word segmentation character set, and create a reverse index of the business vocabulary sequence Information; take the reverse index information as the root node and the word segmentation character set as the leaf node, and create a preset business dictionary tree according to the root node and the leaf node.
  • the server crawls the business domain vocabulary set from the web page, extracts the business vocabulary list from the preset database, and combines the business domain vocabulary and the business vocabulary in the business vocabulary list to remove duplicates to obtain the business vocabulary set.
  • TF-IDF term frequency-inverse document frequency
  • the server classifies each business vocabulary in the business vocabulary sequence according to the first word of the vocabulary to obtain the classified business vocabulary, and performs character string segmentation processing on the classified business vocabulary to obtain a word segmentation character set. Create the reverse index information of the business vocabulary sequence through the preset inverted index algorithm.
  • step 202 The execution process of this step 202 is similar to the execution process of the foregoing step 102, and details are not described herein again.
  • the server converts the word segmentation information to the word vector by presetting the pre-trained word vector of the embedding layer in the neural network model to obtain the text word vector; obtains the target word vector of the text business vocabulary, and judges whether there is a word vector in the text word vector Target word vector; if there is a target word vector in the text word vector, according to the target word vector, the word vector corresponding to the text word vector is spliced to obtain the target word segmentation vector.
  • the server presets the pre-training word vector of the embedding layer in the neural network model, and the pre-training word vector of the embedding layer is The word vector that has been pre-trained by others contains millions of words.
  • the word vector conversion is performed on the word segmentation information to obtain the text word vector.
  • the target word vector of the text business vocabulary is the word vector corresponding to "ren and insurance”.
  • the text word vector has a word vector corresponding to "people and insurance” (that is, the target word vector), if so (that is, the text word vector is the word vector corresponding to "I/name/xia/you/person/person and insurance/"), Then, according to the word vector corresponding to "people and insurance", the word vectors corresponding to "people/he/insurance/" are spliced to obtain the target word vector "I/name/xia/you/person/people/people/insurance/", If not (that is, the text word vector is the word vector corresponding to "I/name/xia/you/person/person/and/insurance/”), then the text word vector is determined as the target word segmentation vector.
  • the server may acquire the business vocabulary corpus in advance, and obtain the representation information and contextual features of the business vocabulary corpus, and use a preset continuous skip metagrammar Skip-Gram model based on the representation of the business vocabulary corpus and the business vocabulary corpus. Information and context features are used for word vector training to obtain pre-trained word vectors.
  • the server extracts the context vector and semantic features of the target word segmentation vector through the feature extraction layer, and multiplies the semantic features by matrix to obtain the initial semantic coding feature; according to the context vector, encodes the initial semantic coding feature to obtain the target Semantic encoding features.
  • the server extracts the context vector of the target word segmentation vector through the bidirectional long short-term memory network layer BiLSTM in the feature extraction layer, and performs syntactic analysis and semantic classification of the semantic features of the target word segmentation vector through the convolutional neural network in the feature extraction layer, and obtains Semantic features, perform matrix transformation on semantic features to obtain semantic matrices, and perform matrix multiplication between semantic matrices to obtain initial semantic coding features to delete redundant semantic features.
  • the context vector is fused from top to bottom to obtain the initial vector, and the server encodes the initial semantic coding feature according to the initial vector through the transformer model in the feature extraction layer. processing to obtain the target semantic encoding feature.
  • the server performs business vocabulary classification on the target semantic coding feature by presetting the output layer in the neural network model to obtain the initial classification probability value; sorts the initial classification probability value in descending order of the value, and sorts the The first initial classification probability value is determined as the candidate classification probability value, and it is determined whether the candidate classification probability value is greater than the preset threshold; if the candidate classification probability value is greater than the preset threshold, the candidate classification probability value is determined as the target classification probability value; if If the candidate classification probability value is less than or equal to the preset threshold, the pending classification probability value of the to-be-processed text information is re-acquired; the pending classification probability value and the preset threshold are compared and analyzed to obtain the target classification probability value.
  • the output layer in the neural network model classify the business vocabulary of the target semantic coding feature to obtain the initial classification probability values A1, A2 and A3, and sort A1, A2 and A3 in descending order of the values.
  • B is determined as the target classification Probability value
  • B is less than or equal to the preset threshold
  • the B or A2 corresponding to the maximum value in B and A2 can be determined as the target classification probability value; it is also possible to iteratively re-acquire the to-be-processed semantic encoding feature of the to-be-processed text information, and Obtain the classification probability value C to be processed of the semantic coding feature to be processed, until C is greater than the preset threshold, thereby obtaining the target classification probability value.
  • the preset neural network model also includes an output layer, which is used for classifying business words and selecting probability values for the target semantic coding features.
  • the output layer may also include one or more classifiers. If the output layer includes multiple classifications
  • the network structure of multiple classifiers can be a network structure connected in parallel, that is, the same input can also be a network structure connected according to a preset connection method, that is, the input of the next classifier can be the previous classifier. Output.
  • the server extracts the corresponding business keywords in the text information to be processed based on the target classification probability value to obtain the initial business keywords; and sequentially performs splicing, part-of-speech filtering and dictionary tree matching on the initial business keywords to obtain the target business keywords.
  • the server marks each word in the text information to be processed according to the target classification probability value, extracts the marked words in the text information to be processed or deletes the non-marked words in the text information to be processed, so as to obtain the initial word, for example, with the target business key
  • the word is insurance name.
  • the server marks each word in the text information to be processed according to the target classification probability value, and obtains "insurance name insurance name insurance name insurance name o o ", extract the words marked (ie "insurance name") in the text information to be processed or delete the words that are not marked (ie "o") in the text information to be processed, so as to obtain the initial business keywords "hua” and "color””
  • the server splices the initial words according to the preset splicing rules to obtain the spliced words, and filters the spliced words through the preset part-of-speech filtering rules to obtain candidate business keywords. Match the candidate business keywords.
  • the server can send the text information to be processed and the candidate business keywords to the preset review terminal, and the preset
  • the auditor or the preset model at the auditing end extracts the business vocabulary from the text information to be processed, and obtains the target business keywords, so as to improve the accuracy of the target business keywords.
  • the server obtains the initial historical text information and the initial historical business words corresponding to the initial historical text information from the preset database, matches the target historical text information corresponding to the text information to be processed from the initial historical text information, and obtains the corresponding target historical text information.
  • the target historical business word is assigned directly according to the text matching result, that is, if the target historical business word is exactly the same as the target business keyword, the first error value is 0, and if the target historical business word and the target business keyword are different, the value is assigned The first error value is 1, and the second error value of the target business keyword based on the manual review is obtained, and the sum of the first error value and the second error value is calculated, or the weight of the first error value and the second error value is calculated.
  • the server obtains the corrected business word based on manual correction, if not, generates the null character of the corrected business word of the target business keyword, and will
  • the target error value and the corrected business word of the target business keyword are stored in the preset storage space, and the network layer, network structure and model parameters of the preset neural network model are adjusted and optimized according to the target error value and the corrected business word, so as to Improve the accuracy of preset neural network models.
  • the server may match the target historical text information corresponding to the to-be-processed text information from the initial historical text information by acquiring the initial historical text information and the initial historical business words corresponding to the initial historical text information from the preset database. , and obtain the target historical business word corresponding to the target historical text information, calculate the similarity between the target historical business word and the target business keyword, and take the absolute value of the difference between the similarity and 1 as the first value of the target business keyword. difference.
  • the word segmentation processing of the text business vocabulary is also improved by optimizing the preset neural network model according to the target error value and correcting business words, thereby improving the accuracy of the preset neural network model.
  • an embodiment of the apparatus for extracting business keywords in the embodiments of the present application include:
  • the matching module 301 is used to obtain the text information to be processed, and by presetting the business dictionary tree, perform business vocabulary matching on the text information to be processed to obtain the text business vocabulary;
  • the word segmentation module 302 is configured to perform word segmentation processing on the text information to be processed according to the text business vocabulary to obtain word segmentation information;
  • the conversion module 303 is configured to perform word vector conversion on the word segmentation information through the embedded layer in the preset neural network model to obtain the target word segmentation vector, and the preset neural network model includes an embedded layer and a feature extraction layer;
  • the encoding module 304 is configured to perform semantic feature extraction and context semantic encoding on the target word segmentation vector through the feature extraction layer to obtain the target semantic encoding feature;
  • the extraction module 305 is configured to perform classification and keyword extraction of the text information to be processed in sequence according to the target semantic coding feature to obtain target business keywords.
  • each module in the above apparatus for extracting business keywords corresponds to each step in the above embodiment of the above method for extracting business keywords, and the functions and implementation processes thereof will not be repeated here.
  • the word segmentation processing of the text business vocabulary by adopting the business vocabulary matching of the preset business dictionary tree, the word segmentation processing of the text business vocabulary, the semantic feature extraction and context semantic encoding of the target word segmentation vector, and the classification and keyword extraction of the target semantic encoding feature, combined with
  • the lexical boundary of the business vocabulary is used to segment the text information to be processed, and the matching accuracy of the business vocabulary of the to-be-processed text information is improved, thereby improving the extraction accuracy of business keywords.
  • FIG. 4 another embodiment of the apparatus for extracting business keywords in the embodiment of the present application includes:
  • the matching module 301 is used to obtain the text information to be processed, and by presetting the business dictionary tree, perform business vocabulary matching on the text information to be processed to obtain the text business vocabulary;
  • the word segmentation module 302 is configured to perform word segmentation processing on the text information to be processed according to the text business vocabulary to obtain word segmentation information;
  • the conversion module 303 is configured to perform word vector conversion on the word segmentation information through the embedded layer in the preset neural network model to obtain the target word segmentation vector, and the preset neural network model includes an embedded layer and a feature extraction layer;
  • the encoding module 304 is configured to perform semantic feature extraction and context semantic encoding on the target word segmentation vector through the feature extraction layer to obtain the target semantic encoding feature;
  • the extraction module 305 is configured to perform classification and keyword extraction of the text information to be processed in turn according to the target semantic coding feature to obtain target business keywords;
  • the extraction module 305 specifically includes:
  • the classification and screening unit 3051 is used to perform business vocabulary classification and probability value screening on the target semantic coding feature through the output layer in the preset neural network model to obtain the target classification probability value;
  • the extraction unit 3052 is configured to extract the corresponding business keywords in the text information to be processed based on the target classification probability value, and obtain the target business keywords;
  • the optimization module 306 is configured to obtain the target error value and the corrected business word based on the target business keyword, and optimize the preset neural network model according to the target error value and the corrected business word.
  • the device for extracting business keywords further includes:
  • a calculation module 307 configured to obtain a business vocabulary set, and calculate the word frequency-inverse text frequency index value of each business vocabulary in the business vocabulary set;
  • a sorting module 308 configured to sort the business vocabulary set according to the word frequency-inverse text frequency index value to obtain a business vocabulary sequence
  • the segmentation module 309 is used to perform character string segmentation processing on the business vocabulary sequence, obtain a word segmentation character set, and create reverse index information of the business vocabulary sequence;
  • the creation module 310 is configured to use the reverse index information as the root node, the word segmentation character set as the leaf node, and create a preset service dictionary tree according to the root node and the leaf node.
  • classification and screening unit 3051 can also be specifically used for:
  • the target semantic coding features are classified into business words, and the initial classification probability value is obtained;
  • the candidate classification probability value is determined as the target classification probability value
  • the classification probability value to be processed is compared and analyzed with the preset threshold to obtain the target classification probability value.
  • the extraction unit 3052 can also be specifically used for:
  • the corresponding business keywords in the text information to be processed are extracted to obtain the initial business keywords;
  • the initial business keywords are sequentially spliced, part of speech filtering and dictionary tree matching to obtain the target business keywords.
  • the conversion module 303 can also be specifically used for:
  • the word segmentation information is converted into the word vector to obtain the text word vector;
  • the word vector corresponding to the text word vector is spliced to obtain the target word segmentation vector.
  • each module and each unit in the above apparatus for extracting business keywords corresponds to each step in the above embodiment of the above method for extracting business keywords, and their functions and implementation processes will not be repeated here.
  • the word segmentation processing of the text business vocabulary is also improved by optimizing the preset neural network model according to the target error value and correcting business words, thereby improving the accuracy of the preset neural network model.
  • the device 500 for extracting business keywords may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the apparatus 500 for extracting business keywords.
  • the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the business keyword extraction device 500.
  • the device 500 for extracting business keywords may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and more.
  • operating systems 531 such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and more.
  • the present application also provides a device for extracting business keywords, including: a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected through a line; the at least one processor The processor invokes the instructions in the memory, so that the service keyword extraction device executes the steps in the above-mentioned service keyword extraction method.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium may also be a volatile computer-readable storage medium.
  • word segmentation is performed on the text information to be processed to obtain word segmentation information
  • the preset neural network model includes an embedded layer and a feature extraction layer;
  • semantic feature extraction and context semantic encoding are performed on the target word segmentation vector, and the target semantic encoding feature is obtained;
  • the text information to be processed is classified and keyword extracted in turn to obtain target business keywords.
  • the computer-readable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required by at least one function, and the like; Use the created data, etc.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the integrated unit if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, removable hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及人工智能技术领域,提供一种业务关键词的提取方法、装置、设备及存储介质,用于提高业务关键词的提取准确性。业务关键词的提取方法包括:通过预置业务字典树对待处理文本信息进行业务词汇匹配,得到文本业务词汇;根据文本业务词汇对待处理文本信息进行分词处理,得到分词信息;通过预置神经网络模型中的嵌入层,将分词信息进行字词向量转换得到目标分词向量;通过预置神经网络模型中特征提取层,对目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;根据目标语义编码特征,对待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。此外,本申请还涉及区块链技术,待处理文本信息可存储于区块链中。

Description

业务关键词的提取方法、装置、设备及存储介质
本申请要求于2020年12月23日提交中国专利局、申请号为202011544588.9、发明名称为“业务关键词的提取方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及人工智能的自然语言处理领域,尤其涉及一种业务关键词的提取方法、装置、设备及存储介质。
背景技术
随着计算机技术的发展,诸多业务事项都需要用到关键词提取的处理方式,比如:对业务领域的业务关键词的提取。现有的自动抽取业务关键词一般都是使用命名实体识别(named entity recognition,NER)任务的模型,命名实体识别是指识别文本中具有特定意义的实体,如人名和地名,而命名实体识别任务的模型采用的是基于词汇字符的方法对识别文本进行分词,然后提取其关键词。
但是,发明人意识到由于中文分词存在误差,且没有利用业务领域的词汇信息,以及结合词汇边界对识别文本进行分词,因而,导致了分词存在误差,提取的关键词不适于业务领域,从而,导致了业务关键词的提取准确性较低。
发明内容
本申请提供一种业务关键词的提取方法、装置、设备及存储介质,用于提高业务关键词的提取准确性。
本申请第一方面提供了一种业务关键词的提取方法,包括:
获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹配,得到文本业务词汇;
根据所述文本业务词汇,对所述待处理文本信息进行分词处理,得到分词信息;
通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,所述预置神经网络模型包括嵌入层和特征提取层;
通过所述特征提取层,对所述目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
本申请第二方面提供了一种业务关键词的提取设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹配,得到文本业务词汇;
根据所述文本业务词汇,对所述待处理文本信息进行分词处理,得到分词信息;
通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,所述预置神经网络模型包括嵌入层和特征提取层;
通过所述特征提取层,对所述目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹 配,得到文本业务词汇;
根据所述文本业务词汇,对所述待处理文本信息进行分词处理,得到分词信息;
通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,所述预置神经网络模型包括嵌入层和特征提取层;
通过所述特征提取层,对所述目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
本申请第四方面提供了一种业务关键词的提取装置,包括:
匹配模块,用于获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹配,得到文本业务词汇;
分词模块,用于根据所述文本业务词汇,对所述待处理文本信息进行分词处理,得到分词信息;
转换模块,用于通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,所述预置神经网络模型包括嵌入层和特征提取层;
编码模块,用于通过所述特征提取层,对所述目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
提取模块,用于根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
本申请提供的技术方案中,获取待处理文本信息,通过预置业务字典树,对待处理文本信息进行业务词汇匹配,得到文本业务词汇;根据文本业务词汇,对待处理文本信息进行分词处理,得到分词信息;通过预置神经网络模型中的嵌入层,将分词信息进行字词向量转换,得到目标分词向量,预置神经网络模型包括嵌入层和特征提取层;通过特征提取层,对目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;根据目标语义编码特征,对待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。本申请实施例中,通过采用预置业务字典树的业务词汇匹配、文本业务词汇的分词处理、目标分词向量的语义特征提取和上下文语义编码,以及目标语义编码特征的分类和关键词提取,结合了业务词汇的词汇边界对待处理文本信息进行分词,提高了待处理文本信息的业务词汇匹配的准确性,从而提高了业务关键词的提取准确性。
附图说明
图1为本申请实施例中业务关键词的提取方法的一个实施例示意图;
图2为本申请实施例中业务关键词的提取方法的另一个实施例示意图;
图3为本申请实施例中业务关键词的提取装置的一个实施例示意图;
图4为本申请实施例中业务关键词的提取装置的另一个实施例示意图;
图5为本申请实施例中业务关键词的提取设备的一个实施例示意图。
具体实施方式
本申请实施例提供了一种业务关键词的提取方法、装置、设备及存储介质,提高了业务关键词的提取准确性。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方 法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中业务关键词的提取方法的一个实施例包括:
101、获取待处理文本信息,通过预置业务字典树,对待处理文本信息进行业务词汇匹配,得到文本业务词汇。
可以理解的是,本申请的执行主体可以为业务关键词的提取装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。
服务器调用预置的语音采集器采集用户输入的语音信息,通过预置的语音识别模型对语音信息进行语音识别和文本转换,得到识别文本,检测识别文本是否存在数据缺失,若是,则对识别文本进行缺失值填补,得到处理后的识别文本,并对处理后的识别文本进行安全性度量,得到待处理文本信息,若否,则直接对识别文本进行安全性度量,得到待处理文本信息。或者,服务器可以接收预置界面输入的文本信息,从而得到待处理文本信息。
服务器可通过创建待处理文本信息的目标键,通过目标键遍历预置业务字典树,从预置业务字典树中匹配得到对应的文本业务词汇;服务器也可通过预置的最近公共祖先算法和倍增算法,从预置业务字典树中匹配待处理文本信息对应的文本业务词汇。
102、根据文本业务词汇,对待处理文本信息进行分词处理,得到分词信息。
服务器将待处理文本信息中与文本业务词汇对应的字符进行标记token替换处理,得到初始文本信息,将初始文本信息中的字符进行单字分割,得到分词信息,例如:以待处理文本信息为“我名下有个人和保险,交了3年了”为例说明,文本业务词汇为“人和保险”,将文本业务词汇“人和保险”token替换,得到初始文本信息,将初始文本信息中的字符进行单字分割,得到分词信息“我/名/下/有/个/人和保险/,/交/了/3/年/了”。其中,服务器将初始文本信息中的字符进行单字分割后,可以对分割后的词进行语法检测和敏感词判断,将符合语法以及为非敏感词的分词确定为分词信息;也可按照预置的词性过滤规则对分割后的词进行词性过滤,得到分词信息。
103、通过预置神经网络模型中的嵌入层,将分词信息进行字词向量转换,得到目标分词向量,预置神经网络模型包括嵌入层和特征提取层。
其中,预置神经网络模型包括嵌入层和特征提取层,嵌入层embedding使用的是预先训练的字向量。服务器获取文本业务词汇对应的业务词汇向量,判断嵌入层中预先训练的字向量中是否存在与业务词汇向量一致的词向量,若是,则将分词信息映射到预置的与业务词汇向量对应的维度空间,从而得到目标分词向量,若否,则将分词信息映射到预置的字向量对应的维度空间,从而得到目标分词向量,例如:若嵌入层中预先训练的字向量中存在业务词汇向量“人和保险”,则将分词信息映射到预置的与业务词汇向量对应的维度空间,从而得到目标分词向量“人和保险”,若嵌入层中预先训练的字向量中不存在业务词汇向量“人和保险”,则将分词信息映射到预置的字向量对应的维度空间,从而得到目标分词向量“人”“和”“保”“险”。
在另一实施例中,服务器也可通过嵌入层中预先训练的字向量,将分词信息映射到预置的维度空间,从而得到字词向量,该字词向量包括字向量和词向量,并获取分词信息中文本业务词汇对应的文本业务词汇向量,计算文本业务词汇向量与字词向量的余弦距离值,判断各余弦距离值是否大于预置的目标值,若是,则判定字词向量存在与文本业务词汇向量对应的向量,将字词向量确定为目标分词向量,若否,则判定字词向量不存在与文本业务词汇向量对应的向量,则将字词向量中对应文本业务词汇位置的向量替换为文本业务词汇向量,从而得到目标分词向量。
104、通过特征提取层,对目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征。
其中,特征提取层可为双向长短期记忆网络层(bi-directional long short-term memory,BiLSTM)、卷积神经网络(convolutional neural networks,CNN)和/或变压模型transformer等,特征提取层具有普适性,其网络结构不受限制。服务器得到目标分词向量后,将目标分词向量输入至预置神经网络模型,服务器通过预置神经网络模型中的特征提取层,对目标分词向量进行语义特征分析和语义特征分类,得到第一信息,对目标分词向量进行义素分析,得到第二信息,可按照预设的权重比例或注意力机制,将第一信息和第二信息进行融合得到综合信息,根据综合信息提取目标分词向量的特征,从而得到初始语义特征,对初始语义特征进行降维处理,得到候选语义特征,提取候选语义特征的上下文向量,通过预置的语义编码模型,对候选语义特征进行基于上下文向量的编码,得到目标语义编码特征。
105、根据目标语义编码特征,对待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
服务器通过预置神经网络模型中的分类器,对目标语义编码特征进行业务词汇分类,得到多个分类值,按照从大到小的顺序将多个分类值进行排序,将待处理文本信息中排序第一的分类值对应的字确定为目标字,从待处理文本信息中提取目标字,按照待处理文本信息中对应的序列将目标字进行组合,得到目标业务关键词,其中,预置神经网络模型可以包括一个或一个以上的分类器,若输出层包含多个分类器,则多个分类器的网络结构可为并列连接的网络结构,即相同的输入,也可为按照预置连接方式所连接的网络结构,即下一个分类器的输入可为上一个分类器的输出。
本申请实施例中,通过采用预置业务字典树的业务词汇匹配、文本业务词汇的分词处理、目标分词向量的语义特征提取和上下文语义编码,以及目标语义编码特征的分类和关键词提取,结合了业务词汇的词汇边界对待处理文本信息进行分词,提高了待处理文本信息的业务词汇匹配的准确性,从而提高了业务关键词的提取准确性。
请参阅图2,本申请实施例中业务关键词的提取方法的另一个实施例包括:
201、获取待处理文本信息,通过预置业务字典树,对待处理文本信息进行业务词汇匹配,得到文本业务词汇。
该步骤201的执行过程与上述步骤101的执行过程类似,在此不再赘述。
具体地,服务器获取待处理文本信息,通过预置业务字典树,对待处理文本信息进行业务词汇匹配,得到文本业务词汇之前,获取业务词汇集,并计算业务词汇集中每个业务词汇的词频-逆文本频率指数值;根据词频-逆文本频率指数值,对业务词汇集进行排序,得到业务词汇序列;对业务词汇序列进行字符串分割处理,得到分词字符集,并创建业务词汇序列的反向索引信息;将反向索引信息作为根结点,将分词字符集作为叶节点,根据根结点和叶节点,创建预置业务字典树。
服务器从网页中爬取业务领域词汇集,并从预置数据库中提取业务词汇列表,将业务领域词汇和业务词汇列表中的业务词汇进行合并去重,得到业务词汇集。计算业务词汇集中每个业务词汇,基于预置查询语料文本和预置文章的词频-逆文本频率指数(term frequency–inverse document frequency,TF-IDF)值,该预置查询语料文本和预置文章为各业务领域和/或业务需求对应的文本和文章,按照每个业务词汇的TF-IDF值从大到小的顺序,对业务词汇集中的业务词汇进行排序,从而得到业务词汇序列,以提高检索效率。
服务器根据词汇第一个字,对业务词汇序列中的各业务词汇进行归类,得到归类业务词汇,并对归类业务词汇进行字符串分割处理,得到分词字符集。通过预置的倒排索引算法创建业务词汇序列的反向索引信息。将反向索引信息作为根结点,将归类业务词汇的相同字的字符作为父节点,将其他的字的分词字符作为叶节点,根据根结点、父节点和叶节点创建字典树,从而得到预置业务字典树,例如:以保险产品“人和保险”、“健康险”和 “健利宝”为例说明,“人和保险”的第一个词为“人”,“健康险”的第一个词为“健”,“健利宝”的第一个词为“健”,则将“人和保险”归为一类,将“健康险”和“健利宝”归为一类,对“人和保”险、“健康险”和“健利宝”进行字符串分割处理,得到“人/和/保/险、健/康/险和健/利/宝”对应的分词字符集,将“人和保”险、“健康险”和“健利宝”的反向索引信息作为预置业务字典树根结点,以“人”和“健”作为父节点,父节点为根结点的下一层级结点,将“和”、“保”和“险”依次作为父节点“人”的叶节点,将“康”和“险”依次作为父节点“健”的第一分支的叶节点,将“利”和“宝”依次作为父节点“健”的第二分支的叶节点,从而得到预置业务字典树。
202、根据文本业务词汇,对待处理文本信息进行分词处理,得到分词信息。
该步骤202的执行过程与上述步骤102的执行过程类似,在此不再赘述。
203、通过预置神经网络模型中的嵌入层,将分词信息进行字词向量转换,得到目标分词向量,预置神经网络模型包括嵌入层和特征提取层。
具体地,服务器通过预置神经网络模型中嵌入层的预训练字向量,将分词信息进行字词向量转换,得到文本字向量;获取文本业务词汇的目标词向量,并判断文本字向量中是否存在目标词向量;若文本字向量中存在目标词向量,则按照目标词向量,将文本字向量对应的字向量进行拼接,得到目标分词向量。
例如,以分词信息为“我/名/下/有/个/人和保险/”为例,服务器通过预置神经网络模型中嵌入层的预训练字向量,该嵌入层的预训练字向量为他人已经预训练好的字向量,包含几百万的字词,将分词信息进行字词向量转换,得到文本字向量,文本业务词汇的目标词向量为“人和保险”对应的词向量,判断文本字向量是否存在“人和保险”对应的词向量(即目标词向量),若是(即文本字向量为“我/名/下/有/个/人和保险/”对应的字向量),则按照“人和保险”对应的词向量,将“人/和/保险/”对应的字向量进行拼接,从而得到目标分词向量“我/名/下/有/个/人和保险/”,若否(即文本字向量为“我/名/下/有/个/人/和/保险/”对应的字向量),则将文本字向量确定为目标分词向量。
在另一实施例中,服务器可预先获取业务词汇语料,并获取业务词汇语料的表征信息和上下文特征,通过预置的连续跳跃元语法Skip-Gram模型,基于业务词汇语料、业务词汇语料的表征信息和上下文特征,进行字向量训练,从而得到预训练字向量。
204、通过特征提取层,对目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征。
具体地,服务器通过特征提取层,提取目标分词向量的上下文向量和语义特征,并将语义特征进行矩阵相乘,得到初始语义编码特征;根据上下文向量,对初始语义编码特征进行编码处理,得到目标语义编码特征。
服务器通过特征提取层中的双向长短期记忆网络层BiLSTM,提取目标分词向量的上下文向量,并通过特征提取层中的卷积神经网络,对目标分词向量进行语义特征的句法分析和语义分类,得到语义特征,将语义特征进行矩阵转换,得到语义矩阵,将语义矩阵之间进行矩阵相乘,得到初始语义编码特征,以删除冗余的语义特征。通过特征提取层中的融合网络和注意力机制网络,将上下文向量自上而下进行融合,得到初始向量,服务器通过特征提取层中的变压模型transformer,根据初始向量对初始语义编码特征进行编码处理,得到目标语义编码特征。
205、通过预置神经网络模型中的输出层,对目标语义编码特征进行业务词汇分类和概率值筛选,得到目标分类概率值。
具体地,服务器通过预置神经网络模型中的输出层,对目标语义编码特征进行业务词汇分类,得到初始分类概率值;按照值从大到小的顺序,对初始分类概率值进行排序,将排序第一的初始分类概率值确定为候选分类概率值,并判断候选分类概率值是否大于预设 阈值;若候选分类概率值大于预设阈值,则将候选分类概率值确定为目标分类概率值;若候选分类概率值小于或等于预设阈值,则重新获取待处理文本信息的待处理分类概率值;将待处理分类概率值与预设阈值进行对比分析,得到目标分类概率值。
例如,通过预置神经网络模型中的输出层,对目标语义编码特征进行业务词汇分类,得到初始分类概率值A1、A2和A3,按照值从大到小的顺序对A1、A2和A3进行排序为A2、A1和A3,则A2为候选分类概率值,若候选分类概率值A2大于预设阈值,则将A2确定为目标分类概率值,若候选分类概率值A2小于或等于预设阈值,则根据上述101-104的执行过程重新获取待处理文本信息的待处理语义编码特征,以及获取待处理语义编码特征的待处理分类概率值B,若B大于预设阈值,则将B确定为目标分类概率值,若B小于或等于预设阈值,则可将B和A2中最大值对应的B或A2确定为目标分类概率值;也可迭代重新获取待处理文本信息的待处理语义编码特征,以及获取待处理语义编码特征的待处理分类概率值C,直至C大于预设阈值,从而得到目标分类概率值。
其中,预置神经网络模型还包括输出层,输出层用于对目标语义编码特征进行业务词汇分类和概率值筛选,输出层也可包括一个或一个以上的分类器,若输出层包含多个分类器,则多个分类器的网络结构可为并列连接的网络结构,即相同的输入,也可为按照预置连接方式所连接的网络结构,即下一个分类器的输入可为上一个分类器的输出。
206、基于目标分类概率值,提取待处理文本信息中对应的业务关键词,得到目标业务关键词。
具体地,服务器基于目标分类概率值,提取待处理文本信息中对应的业务关键词,得到初始业务关键词;对初始业务关键词依次进行拼接、词性过滤和字典树匹配,得到目标业务关键词。
服务器根据目标分类概率值对待处理文本信息中的各个字进行标记,提取待处理文本信息中标记的字或者将待处理文本信息中非标记的字删除,从而得到初始词,例如,以目标业务关键词为保险名,以待处理文本信息为“华彩人生如何”为例说明,服务器根据目标分类概率值对待处理文本信息中的各个字进行标记,得到“保险名保险名保险名保险名o o”,提取待处理文本信息中标记(即“保险名”)的字或者将待处理文本信息中非标记(即“o”)的字删除,从而得到初始业务关键词“华”“彩”“人”“生”,服务器得到初始词之后,按照预设拼接规则对初始词进行拼接,得到拼接词,通过预置词性过滤规则对拼接词进行过滤,得到候选业务关键词,通过预置字典树对候选业务关键词进行匹配,若从预置字典树中匹配到对应的业务词,则说明候选业务关键词符合业务词,将候选业务关键词确定为目标业务关键词,若从预置字典树中匹配到不对应的业务词,则说明候选业务关键词不符合业务词或不存在预置字典树中,服务器可将待处理文本信息和候选业务关键词发送至预置审核端,由预置审核端的审核员或预置模型对待处理文本信息进行业务词汇提取,得到目标业务关键词,以提高目标业务关键词的准确性。
207、获取基于目标业务关键词的目标误差值和修正业务词,根据目标误差值和修正业务词,对预置神经网络模型进行优化。
服务器从预置数据库中获取初始历史文本信息和初始历史文本信息对应的初始历史业务词,从初始历史文本信息中匹配与待处理文本信息对应的目标历史文本信息,并获取目标历史文本信息对应的目标历史业务词,直接按文本匹配结果赋值,即若目标历史业务词与目标业务关键词完全相同,则赋值第一误差值为0,若目标历史业务词与目标业务关键词不相同,则赋值第一误差值为1,并获取基于人工评审的目标业务关键词的第二误差值,计算第一误差值和第二误差值的和值,或者计算第一误差值和第二误差值的权重和值,得到目标误差值,判断目标误差值是否大于预设修正阈值,若是,则服务器获取基于人工修正的修正业务词,若否,则生成目标业务关键词的修正业务词的空字符,将目标业务关键 词的目标误差值和修正业务词存储至预设存储空间,并根据目标误差值和修正业务词,对预置神经网络模型的网络层、网络结构和模型参数进行调整和优化,以提高预置神经网络模型的准确性。
在另一实施例中,服务器可通过从预置数据库中获取初始历史文本信息和初始历史文本信息对应的初始历史业务词,从初始历史文本信息中匹配与待处理文本信息对应的目标历史文本信息,并获取目标历史文本信息对应的目标历史业务词,计算目标历史业务词和目标业务关键词之间的相似度,将相似度与1之间的差值绝对值作为目标业务关键词的第一误差值。
本申请实施例中,不仅通过采用预置业务字典树的业务词汇匹配、文本业务词汇的分词处理、目标分词向量的语义特征提取和上下文语义编码,以及基于输出层和目标语义编码特征的业务词汇分类和概率值筛选,和基于目标分类概率值的业务关键词,结合了业务词汇的词汇边界对待处理文本信息进行分词,提高了待处理文本信息的业务词汇匹配的准确性,从而提高了业务关键词的提取准确性,还通过根据目标误差值和修正业务词,对预置神经网络模型进行优化,提高了预置神经网络模型的准确性。
上面对本申请实施例中业务关键词的提取方法进行了描述,下面对本申请实施例中业务关键词的提取装置进行描述,请参阅图3,本申请实施例中业务关键词的提取装置一个实施例包括:
匹配模块301,用于获取待处理文本信息,通过预置业务字典树,对待处理文本信息进行业务词汇匹配,得到文本业务词汇;
分词模块302,用于根据文本业务词汇,对待处理文本信息进行分词处理,得到分词信息;
转换模块303,用于通过预置神经网络模型中的嵌入层,将分词信息进行字词向量转换,得到目标分词向量,预置神经网络模型包括嵌入层和特征提取层;
编码模块304,用于通过特征提取层,对目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
提取模块305,用于根据目标语义编码特征,对待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
上述业务关键词的提取装置中各个模块的功能实现与上述业务关键词的提取方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。
本申请实施例中,通过采用预置业务字典树的业务词汇匹配、文本业务词汇的分词处理、目标分词向量的语义特征提取和上下文语义编码,以及目标语义编码特征的分类和关键词提取,结合了业务词汇的词汇边界对待处理文本信息进行分词,提高了待处理文本信息的业务词汇匹配的准确性,从而提高了业务关键词的提取准确性。
请参阅图4,本申请实施例中业务关键词的提取装置的另一个实施例包括:
匹配模块301,用于获取待处理文本信息,通过预置业务字典树,对待处理文本信息进行业务词汇匹配,得到文本业务词汇;
分词模块302,用于根据文本业务词汇,对待处理文本信息进行分词处理,得到分词信息;
转换模块303,用于通过预置神经网络模型中的嵌入层,将分词信息进行字词向量转换,得到目标分词向量,预置神经网络模型包括嵌入层和特征提取层;
编码模块304,用于通过特征提取层,对目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
提取模块305,用于根据目标语义编码特征,对待处理文本信息依次进行分类和关键词提取,得到目标业务关键词;
其中,提取模块305具体包括:
分类筛选单元3051,用于通过预置神经网络模型中的输出层,对目标语义编码特征进行业务词汇分类和概率值筛选,得到目标分类概率值;
提取单元3052,用于基于目标分类概率值,提取待处理文本信息中对应的业务关键词,得到目标业务关键词;
优化模块306,用于获取基于目标业务关键词的目标误差值和修正业务词,根据目标误差值和修正业务词,对预置神经网络模型进行优化。
可选的,业务关键词的提取装置,还包括:
计算模块307,用于获取业务词汇集,并计算业务词汇集中每个业务词汇的词频-逆文本频率指数值;
排序模块308,用于根据词频-逆文本频率指数值,对业务词汇集进行排序,得到业务词汇序列;
分割模块309,用于对业务词汇序列进行字符串分割处理,得到分词字符集,并创建业务词汇序列的反向索引信息;
创建模块310,用于将反向索引信息作为根结点,将分词字符集作为叶节点,根据根结点和叶节点,创建预置业务字典树。
可选的,分类筛选单元3051还可以具体用于:
通过预置神经网络模型中的输出层,对目标语义编码特征进行业务词汇分类,得到初始分类概率值;
按照值从大到小的顺序,对初始分类概率值进行排序,将排序第一的初始分类概率值确定为候选分类概率值,并判断候选分类概率值是否大于预设阈值;
若候选分类概率值大于预设阈值,则将候选分类概率值确定为目标分类概率值;
若候选分类概率值小于或等于预设阈值,则重新获取待处理文本信息的待处理分类概率值;
将待处理分类概率值与预设阈值进行对比分析,得到目标分类概率值。
可选的,提取单元3052还可以具体用于:
基于目标分类概率值,提取待处理文本信息中对应的业务关键词,得到初始业务关键词;
对初始业务关键词依次进行拼接、词性过滤和字典树匹配,得到目标业务关键词。
可选的,转换模块303还可以具体用于:
通过预置神经网络模型中嵌入层的预训练字向量,将分词信息进行字词向量转换,得到文本字向量;
获取文本业务词汇的目标词向量,并判断文本字向量中是否存在目标词向量;
若文本字向量中存在目标词向量,则按照目标词向量,将文本字向量对应的字向量进行拼接,得到目标分词向量。
上述业务关键词的提取装置中各模块和各单元的功能实现与上述业务关键词的提取方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。
本申请实施例中,不仅通过采用预置业务字典树的业务词汇匹配、文本业务词汇的分词处理、目标分词向量的语义特征提取和上下文语义编码,以及基于输出层和目标语义编码特征的业务词汇分类和概率值筛选,和基于目标分类概率值的业务关键词,结合了业务词汇的词汇边界对待处理文本信息进行分词,提高了待处理文本信息的业务词汇匹配的准确性,从而提高了业务关键词的提取准确性,还通过根据目标误差值和修正业务词,对预置神经网络模型进行优化,提高了预置神经网络模型的准确性。
上面图3和图4从模块化功能实体的角度对本申请实施例中的业务关键词的提取装置 进行详细描述,下面从硬件处理的角度对本申请实施例中业务关键词的提取设备进行详细描述。
图5是本申请实施例提供的一种业务关键词的提取设备的结构示意图,该业务关键词的提取设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对业务关键词的提取设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在业务关键词的提取设备500上执行存储介质530中的一系列指令操作。
业务关键词的提取设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的业务关键词的提取设备结构并不构成对业务关键词的提取设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请还提供一种业务关键词的提取设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;所述至少一个处理器调用所述存储器中的所述指令,以使得所述业务关键词的提取设备执行上述业务关键词的提取方法中的步骤。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,计算机可读存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行如下步骤:
获取待处理文本信息,通过预置业务字典树,对待处理文本信息进行业务词汇匹配,得到文本业务词汇;
根据文本业务词汇,对待处理文本信息进行分词处理,得到分词信息;
通过预置神经网络模型中的嵌入层,将分词信息进行字词向量转换,得到目标分词向量,预置神经网络模型包括嵌入层和特征提取层;
通过特征提取层,对目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
根据目标语义编码特征,对待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
进一步地,计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来, 该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种业务关键词的提取方法,其中,所述业务关键词的提取方法包括:
    获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹配,得到文本业务词汇;
    根据所述文本业务词汇,对所述待处理文本信息进行分词处理,得到分词信息;
    通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,所述预置神经网络模型包括嵌入层和特征提取层;
    通过所述特征提取层,对所述目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
    根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
  2. 根据权利要求1所述的业务关键词的提取方法,其中,所述获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹配,得到文本业务词汇之前,还包括:
    获取业务词汇集,并计算所述业务词汇集中每个业务词汇的词频-逆文本频率指数值;
    根据所述词频-逆文本频率指数值,对所述业务词汇集进行排序,得到业务词汇序列;
    对所述业务词汇序列进行字符串分割处理,得到分词字符集,并创建所述业务词汇序列的反向索引信息;
    将所述反向索引信息作为根结点,将所述分词字符集作为叶节点,根据所述根结点和所述叶节点,创建预置业务字典树。
  3. 根据权利要求1所述的业务关键词的提取方法,其中,所述根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词,包括:
    通过所述预置神经网络模型中的输出层,对所述目标语义编码特征进行业务词汇分类和概率值筛选,得到目标分类概率值;
    基于所述目标分类概率值,提取所述待处理文本信息中对应的业务关键词,得到目标业务关键词。
  4. 根据权利要求3所述的业务关键词的提取方法,其中,所述通过所述预置神经网络模型中的输出层,对所述目标语义编码特征进行业务词汇分类和概率值筛选,得到目标分类概率值,包括:
    通过所述预置神经网络模型中的输出层,对所述目标语义编码特征进行业务词汇分类,得到初始分类概率值;
    按照值从大到小的顺序,对所述初始分类概率值进行排序,将排序第一的初始分类概率值确定为候选分类概率值,并判断所述候选分类概率值是否大于预设阈值;
    若所述候选分类概率值大于所述预设阈值,则将所述候选分类概率值确定为目标分类概率值;
    若所述候选分类概率值小于或等于所述预设阈值,则重新获取所述待处理文本信息的待处理分类概率值;
    将所述待处理分类概率值与所述预设阈值进行对比分析,得到目标分类概率值。
  5. 根据权利要求3所述的业务关键词的提取方法,其中,所述基于所述目标分类概率值,提取所述待处理文本信息中对应的业务关键词,得到目标业务关键词,包括:
    基于所述目标分类概率值,提取所述待处理文本信息中对应的业务关键词,得到初始业务关键词;
    对所述初始业务关键词依次进行拼接、词性过滤和字典树匹配,得到目标业务关键词。
  6. 根据权利要求1所述的业务关键词的提取方法,其中,所述通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,包括:
    通过预置神经网络模型中嵌入层的预训练字向量,将所述分词信息进行字词向量转换,得到文本字向量;
    获取所述文本业务词汇的目标词向量,并判断所述文本字向量中是否存在所述目标词向量;
    若所述文本字向量中存在所述目标词向量,则按照所述目标词向量,将所述文本字向量对应的字向量进行拼接,得到目标分词向量。
  7. 根据权利要求1-6中任一项所述的业务关键词的提取方法,其中,所述根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词之后,还包括:
    获取基于所述目标业务关键词的目标误差值和修正业务词,根据所述目标误差值和所述修正业务词,对所述预置神经网络模型进行优化。
  8. 一种业务关键词的提取设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹配,得到文本业务词汇;
    根据所述文本业务词汇,对所述待处理文本信息进行分词处理,得到分词信息;
    通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,所述预置神经网络模型包括嵌入层和特征提取层;
    通过所述特征提取层,对所述目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
    根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
  9. 根据权利要求8所述的业务关键词的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    获取业务词汇集,并计算所述业务词汇集中每个业务词汇的词频-逆文本频率指数值;
    根据所述词频-逆文本频率指数值,对所述业务词汇集进行排序,得到业务词汇序列;
    对所述业务词汇序列进行字符串分割处理,得到分词字符集,并创建所述业务词汇序列的反向索引信息;
    将所述反向索引信息作为根结点,将所述分词字符集作为叶节点,根据所述根结点和所述叶节点,创建预置业务字典树。
  10. 根据权利要求8所述的业务关键词的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    通过所述预置神经网络模型中的输出层,对所述目标语义编码特征进行业务词汇分类和概率值筛选,得到目标分类概率值;
    基于所述目标分类概率值,提取所述待处理文本信息中对应的业务关键词,得到目标业务关键词。
  11. 根据权利要求10所述的业务关键词的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    通过所述预置神经网络模型中的输出层,对所述目标语义编码特征进行业务词汇分类,得到初始分类概率值;
    按照值从大到小的顺序,对所述初始分类概率值进行排序,将排序第一的初始分类概率值确定为候选分类概率值,并判断所述候选分类概率值是否大于预设阈值;
    若所述候选分类概率值大于所述预设阈值,则将所述候选分类概率值确定为目标分类概率值;
    若所述候选分类概率值小于或等于所述预设阈值,则重新获取所述待处理文本信息的待处理分类概率值;
    将所述待处理分类概率值与所述预设阈值进行对比分析,得到目标分类概率值。
  12. 根据权利要求10所述的业务关键词的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    基于所述目标分类概率值,提取所述待处理文本信息中对应的业务关键词,得到初始业务关键词;
    对所述初始业务关键词依次进行拼接、词性过滤和字典树匹配,得到目标业务关键词。
  13. 根据权利要求8所述的业务关键词的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    通过预置神经网络模型中嵌入层的预训练字向量,将所述分词信息进行字词向量转换,得到文本字向量;
    获取所述文本业务词汇的目标词向量,并判断所述文本字向量中是否存在所述目标词向量;
    若所述文本字向量中存在所述目标词向量,则按照所述目标词向量,将所述文本字向量对应的字向量进行拼接,得到目标分词向量。
  14. 根据权利要求8-13中任一项所述的业务关键词的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    获取基于所述目标业务关键词的目标误差值和修正业务词,根据所述目标误差值和所述修正业务词,对所述预置神经网络模型进行优化。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
    获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹配,得到文本业务词汇;
    根据所述文本业务词汇,对所述待处理文本信息进行分词处理,得到分词信息;
    通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,所述预置神经网络模型包括嵌入层和特征提取层;
    通过所述特征提取层,对所述目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
    根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
  16. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    获取业务词汇集,并计算所述业务词汇集中每个业务词汇的词频-逆文本频率指数值;
    根据所述词频-逆文本频率指数值,对所述业务词汇集进行排序,得到业务词汇序列;
    对所述业务词汇序列进行字符串分割处理,得到分词字符集,并创建所述业务词汇序列的反向索引信息;
    将所述反向索引信息作为根结点,将所述分词字符集作为叶节点,根据所述根结点和所述叶节点,创建预置业务字典树。
  17. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    通过所述预置神经网络模型中的输出层,对所述目标语义编码特征进行业务词汇分类 和概率值筛选,得到目标分类概率值;
    基于所述目标分类概率值,提取所述待处理文本信息中对应的业务关键词,得到目标业务关键词。
  18. 根据权利要求17所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    通过所述预置神经网络模型中的输出层,对所述目标语义编码特征进行业务词汇分类,得到初始分类概率值;
    按照值从大到小的顺序,对所述初始分类概率值进行排序,将排序第一的初始分类概率值确定为候选分类概率值,并判断所述候选分类概率值是否大于预设阈值;
    若所述候选分类概率值大于所述预设阈值,则将所述候选分类概率值确定为目标分类概率值;
    若所述候选分类概率值小于或等于所述预设阈值,则重新获取所述待处理文本信息的待处理分类概率值;
    将所述待处理分类概率值与所述预设阈值进行对比分析,得到目标分类概率值。
  19. 根据权利要求17所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    基于所述目标分类概率值,提取所述待处理文本信息中对应的业务关键词,得到初始业务关键词;
    对所述初始业务关键词依次进行拼接、词性过滤和字典树匹配,得到目标业务关键词。
  20. 一种业务关键词的提取装置,其中,所述业务关键词的提取装置包括:
    匹配模块,用于获取待处理文本信息,通过预置业务字典树,对所述待处理文本信息进行业务词汇匹配,得到文本业务词汇;
    分词模块,用于根据所述文本业务词汇,对所述待处理文本信息进行分词处理,得到分词信息;
    转换模块,用于通过预置神经网络模型中的嵌入层,将所述分词信息进行字词向量转换,得到目标分词向量,所述预置神经网络模型包括嵌入层和特征提取层;
    编码模块,用于通过所述特征提取层,对所述目标分词向量进行语义特征提取和上下文语义编码,得到目标语义编码特征;
    提取模块,用于根据所述目标语义编码特征,对所述待处理文本信息依次进行分类和关键词提取,得到目标业务关键词。
PCT/CN2021/109145 2020-12-23 2021-07-29 业务关键词的提取方法、装置、设备及存储介质 WO2022134575A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011544588.9 2020-12-23
CN202011544588.9A CN112632292A (zh) 2020-12-23 2020-12-23 业务关键词的提取方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022134575A1 true WO2022134575A1 (zh) 2022-06-30

Family

ID=75322068

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109145 WO2022134575A1 (zh) 2020-12-23 2021-07-29 业务关键词的提取方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112632292A (zh)
WO (1) WO2022134575A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114911917A (zh) * 2022-07-13 2022-08-16 树根互联股份有限公司 资产元信息搜索方法、装置、计算机设备及可读存储介质
CN115168594A (zh) * 2022-09-08 2022-10-11 北京星天地信息科技有限公司 警情信息处理方法和装置、电子设备和存储介质
CN115496039A (zh) * 2022-11-17 2022-12-20 荣耀终端有限公司 一种词语提取方法及计算机设备
CN115514579A (zh) * 2022-11-09 2022-12-23 北京连星科技有限公司 基于IPv6地址映射流标签实现业务标识的方法及系统
CN115827815A (zh) * 2022-11-17 2023-03-21 西安电子科技大学广州研究院 基于小样本学习的关键词提取方法及装置
CN115904855A (zh) * 2023-03-02 2023-04-04 上海合见工业软件集团有限公司 基于信号动态追踪确定目标驱动源码的系统
CN115951883A (zh) * 2023-03-15 2023-04-11 日照市德衡信息技术有限公司 分布式微服务架构的服务组件管理系统及其方法
CN116127960A (zh) * 2023-04-17 2023-05-16 广东粤港澳大湾区国家纳米科技创新研究院 信息抽取方法、装置、存储介质及计算机设备
CN116484856A (zh) * 2023-02-15 2023-07-25 北京数美时代科技有限公司 一种文本的关键词提取方法、装置、电子设备及存储介质
CN116580849A (zh) * 2023-05-30 2023-08-11 杭州医初科技有限公司 医疗数据的采集分析系统及其方法
CN116956897A (zh) * 2023-09-20 2023-10-27 湖南财信数字科技有限公司 隐性广告处理方法、装置、计算机设备及存储介质
CN116978384A (zh) * 2023-09-25 2023-10-31 成都市青羊大数据有限责任公司 一种公安一体化大数据管理系统
CN116976320A (zh) * 2023-09-22 2023-10-31 湖南财信数字科技有限公司 机构简称提取方法、装置、计算机设备及存储介质
CN117149957A (zh) * 2023-11-01 2023-12-01 腾讯科技(深圳)有限公司 文本处理方法、装置、设备及介质
CN118035456A (zh) * 2024-04-11 2024-05-14 江西微博科技有限公司 基于大数据的电子材料数据共享管理系统

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632292A (zh) * 2020-12-23 2021-04-09 深圳壹账通智能科技有限公司 业务关键词的提取方法、装置、设备及存储介质
CN113051291A (zh) * 2021-04-16 2021-06-29 平安国际智慧城市科技股份有限公司 工单信息的处理方法、装置、设备及存储介质
CN113220838A (zh) * 2021-05-12 2021-08-06 北京百度网讯科技有限公司 确定关键信息的方法、装置、电子设备和存储介质
CN113486659B (zh) * 2021-05-25 2024-03-15 平安科技(深圳)有限公司 文本匹配方法、装置、计算机设备及存储介质
CN113326350B (zh) * 2021-05-31 2023-05-26 江汉大学 基于远程学习的关键词提取方法、系统、设备及存储介质
CN113377965B (zh) * 2021-06-30 2024-02-23 中国农业银行股份有限公司 感知文本关键词的方法及相关装置
CN113361644B (zh) * 2021-07-03 2024-05-14 上海理想信息产业(集团)有限公司 模型训练方法、电信业务特征信息提取方法、装置及设备
CN113553851A (zh) * 2021-07-15 2021-10-26 杭州网易云音乐科技有限公司 关键词的确定方法、装置、存储介质和计算设备
CN113627139A (zh) * 2021-08-11 2021-11-09 平安国际智慧城市科技股份有限公司 企业申报表生成方法、装置、设备及存储介质
CN113626671A (zh) * 2021-08-12 2021-11-09 平安国际智慧城市科技股份有限公司 基于字符匹配的数据分类方法、装置、设备以及存储介质
CN113870478A (zh) * 2021-09-29 2021-12-31 平安银行股份有限公司 快速取号方法、装置、电子设备及存储介质
CN113889281B (zh) * 2021-11-17 2024-05-03 华美浩联医疗科技(北京)有限公司 一种中文医疗智能实体识别方法、装置及计算机设备
CN114817526B (zh) * 2022-02-21 2024-03-29 华院计算技术(上海)股份有限公司 文本分类方法及装置、存储介质、终端
CN114881017A (zh) * 2022-04-25 2022-08-09 南京烽火星空通信发展有限公司 一种自适应动态分词方法
CN115062604A (zh) * 2022-05-31 2022-09-16 联想(北京)有限公司 一种信息处理方法和计算机可读存储介质
CN114743554A (zh) * 2022-06-09 2022-07-12 武汉工商学院 基于物联网的智能家居交互方法及装置
CN115904482B (zh) * 2022-11-30 2023-09-26 杭州巨灵兽智能科技有限公司 接口文档生成方法、装置、设备及存储介质
CN115906768B (zh) * 2023-01-04 2023-05-05 深圳市迪博企业风险管理技术有限公司 企业信息化数据合规性评估方法、系统和可读存储介质
CN117558270B (zh) * 2024-01-11 2024-04-02 腾讯科技(深圳)有限公司 语音识别方法、装置、关键词检测模型的训练方法和装置
CN117608650B (zh) * 2024-01-15 2024-04-09 钱塘科技创新中心 业务流程图生成方法、处理设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990525A (zh) * 2019-11-15 2020-04-10 华融融通(北京)科技有限公司 一种基于自然语言处理的舆情信息抽取及知识库生成方法
US20200151222A1 (en) * 2018-11-09 2020-05-14 Accenture Global Solutions Limited Dark web content analysis and identification
CN112100344A (zh) * 2020-08-18 2020-12-18 淮阴工学院 一种基于知识图谱的金融领域知识问答方法
CN112632292A (zh) * 2020-12-23 2021-04-09 深圳壹账通智能科技有限公司 业务关键词的提取方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151222A1 (en) * 2018-11-09 2020-05-14 Accenture Global Solutions Limited Dark web content analysis and identification
CN110990525A (zh) * 2019-11-15 2020-04-10 华融融通(北京)科技有限公司 一种基于自然语言处理的舆情信息抽取及知识库生成方法
CN112100344A (zh) * 2020-08-18 2020-12-18 淮阴工学院 一种基于知识图谱的金融领域知识问答方法
CN112632292A (zh) * 2020-12-23 2021-04-09 深圳壹账通智能科技有限公司 业务关键词的提取方法、装置、设备及存储介质

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114911917B (zh) * 2022-07-13 2023-01-03 树根互联股份有限公司 资产元信息搜索方法、装置、计算机设备及可读存储介质
CN114911917A (zh) * 2022-07-13 2022-08-16 树根互联股份有限公司 资产元信息搜索方法、装置、计算机设备及可读存储介质
CN115168594A (zh) * 2022-09-08 2022-10-11 北京星天地信息科技有限公司 警情信息处理方法和装置、电子设备和存储介质
CN115514579A (zh) * 2022-11-09 2022-12-23 北京连星科技有限公司 基于IPv6地址映射流标签实现业务标识的方法及系统
CN115496039A (zh) * 2022-11-17 2022-12-20 荣耀终端有限公司 一种词语提取方法及计算机设备
CN115827815A (zh) * 2022-11-17 2023-03-21 西安电子科技大学广州研究院 基于小样本学习的关键词提取方法及装置
CN115827815B (zh) * 2022-11-17 2023-12-29 西安电子科技大学广州研究院 基于小样本学习的关键词提取方法及装置
CN115496039B (zh) * 2022-11-17 2023-05-12 荣耀终端有限公司 一种词语提取方法及计算机设备
CN116484856A (zh) * 2023-02-15 2023-07-25 北京数美时代科技有限公司 一种文本的关键词提取方法、装置、电子设备及存储介质
CN116484856B (zh) * 2023-02-15 2023-11-17 北京数美时代科技有限公司 一种文本的关键词提取方法、装置、电子设备及存储介质
CN115904855A (zh) * 2023-03-02 2023-04-04 上海合见工业软件集团有限公司 基于信号动态追踪确定目标驱动源码的系统
CN115951883A (zh) * 2023-03-15 2023-04-11 日照市德衡信息技术有限公司 分布式微服务架构的服务组件管理系统及其方法
CN116127960B (zh) * 2023-04-17 2023-06-23 广东粤港澳大湾区国家纳米科技创新研究院 信息抽取方法、装置、存储介质及计算机设备
CN116127960A (zh) * 2023-04-17 2023-05-16 广东粤港澳大湾区国家纳米科技创新研究院 信息抽取方法、装置、存储介质及计算机设备
CN116580849A (zh) * 2023-05-30 2023-08-11 杭州医初科技有限公司 医疗数据的采集分析系统及其方法
CN116580849B (zh) * 2023-05-30 2024-01-12 华创天成技术有限公司 医疗数据的采集分析系统及其方法
CN116956897B (zh) * 2023-09-20 2023-12-15 湖南财信数字科技有限公司 隐性广告处理方法、装置、计算机设备及存储介质
CN116956897A (zh) * 2023-09-20 2023-10-27 湖南财信数字科技有限公司 隐性广告处理方法、装置、计算机设备及存储介质
CN116976320A (zh) * 2023-09-22 2023-10-31 湖南财信数字科技有限公司 机构简称提取方法、装置、计算机设备及存储介质
CN116976320B (zh) * 2023-09-22 2023-12-15 湖南财信数字科技有限公司 机构简称提取方法、装置、计算机设备及存储介质
CN116978384A (zh) * 2023-09-25 2023-10-31 成都市青羊大数据有限责任公司 一种公安一体化大数据管理系统
CN116978384B (zh) * 2023-09-25 2024-01-02 成都市青羊大数据有限责任公司 一种公安一体化大数据管理系统
CN117149957A (zh) * 2023-11-01 2023-12-01 腾讯科技(深圳)有限公司 文本处理方法、装置、设备及介质
CN117149957B (zh) * 2023-11-01 2024-01-26 腾讯科技(深圳)有限公司 文本处理方法、装置、设备及介质
CN118035456A (zh) * 2024-04-11 2024-05-14 江西微博科技有限公司 基于大数据的电子材料数据共享管理系统

Also Published As

Publication number Publication date
CN112632292A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2022134575A1 (zh) 业务关键词的提取方法、装置、设备及存储介质
JP5346279B2 (ja) 検索による注釈付与
CN113011533A (zh) 文本分类方法、装置、计算机设备和存储介质
US9355171B2 (en) Clustering of near-duplicate documents
WO2021051518A1 (zh) 基于神经网络模型的文本数据分类方法、装置及存储介质
CN109582972A (zh) 一种基于自然语言识别的光学字符识别纠错方法
WO2022126960A1 (zh) 业务条款数据的处理方法、装置、设备及存储介质
CN110807324A (zh) 一种基于IDCNN-crf与知识图谱的影视实体识别方法
CN113961528A (zh) 基于知识图谱的文件语义关联存储系统及方法
CN110347796A (zh) 向量语义张量空间下的短文本相似度计算方法
US7333997B2 (en) Knowledge discovery method with utility functions and feedback loops
CN114237621B (zh) 一种基于细粒度共注意机制的语义代码搜索方法
CN112307364B (zh) 一种面向人物表征的新闻文本发生地抽取方法
CN116151132A (zh) 一种编程学习场景的智能代码补全方法、系统及储存介质
CN112948601A (zh) 一种基于受控语义嵌入的跨模态哈希检索方法
CN114676346A (zh) 新闻事件处理方法、装置、计算机设备和存储介质
US11604923B2 (en) High volume message classification and distribution
Patil et al. A comparative study of text embedding models for semantic text similarity in bug reports
CN113571198B (zh) 转化率预测方法、装置、设备及存储介质
CN114003750B (zh) 物料上线方法、装置、设备及存储介质
CN113010643B (zh) 佛学领域词汇的处理方法、装置、设备及存储介质
Bossard et al. An evolutionary algorithm for automatic summarization
CN116127013A (zh) 一种个人敏感信息知识图谱查询方法和装置
CN113434698B (zh) 基于全层级注意力的关系抽取模型建立方法及其应用
CN112668284B (zh) 一种法律文书分段方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908580

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.10.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21908580

Country of ref document: EP

Kind code of ref document: A1