CN114090777A

CN114090777A - Text data processing method and device

Info

Publication number: CN114090777A
Application number: CN202111426069.7A
Authority: CN
Inventors: 毛璐; 李长亮
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-02-25

Abstract

The application provides a text data processing method and a text data processing device, wherein the text data processing method comprises the following steps: acquiring data to be processed corresponding to a target field and a feature tag corresponding to the data to be processed; constructing a characteristic information database associated with the target field based on the characteristic information and the characteristic label in the data to be processed; receiving a retrieval request carrying a problem to be retrieved, and constructing a retrieval keyword corresponding to the characteristic information database based on the problem to be retrieved; and querying the characteristic information database according to the retrieval key words, and taking the query result as the response of the retrieval request, thereby realizing the rapid construction of a high-quality knowledge base and supporting the question and answer of complex relationships.

Description

Text data processing method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a text data processing method. The application also relates to a text data processing device, a computing device and a computer readable storage medium.

Background

With the development of internet technology, more and more users acquire related information of various data through the internet, and in the prior art, in order to acquire information of a type of data rapidly in a targeted manner, a method of constructing a knowledge base and searching within the scope of the knowledge base is generally adopted to acquire the related information. However, when retrieving information based on the knowledge base, the information is often limited to accurate search and fuzzy matching, the question and answer retrieval effect with complex relationships is poor, the construction and maintenance of the knowledge base are often completed manually, the labor cost is high, and a certain error rate exists.

Disclosure of Invention

In view of this, embodiments of the present application provide a text data processing method to solve technical defects in the prior art. The embodiment of the application also provides a text data processing device, a computing device and a computer readable storage medium.

According to a first aspect of embodiments of the present application, there is provided a text data processing method, including:

acquiring data to be processed corresponding to a target field and a feature tag corresponding to the data to be processed;

constructing a feature information database associated with the target field based on the feature information in the data to be processed and the feature label;

receiving a retrieval request carrying a problem to be retrieved, and constructing a retrieval keyword corresponding to the characteristic information database based on the problem to be retrieved;

and querying the characteristic information database according to the retrieval key words, and taking the query result as the response of the retrieval request.

Optionally, the obtaining of the feature tag corresponding to the to-be-processed data includes:

analyzing the data to be processed to obtain the characteristic information;

matching the characteristic information with a reference characteristic label contained in a resource information database, and distributing at least one characteristic label for the data to be processed according to a matching result;

wherein the constructing a feature information database associated with the target field based on the feature information in the data to be processed and the feature tag comprises:

and constructing a characteristic information database associated with the target field based on the characteristic information and at least one characteristic label corresponding to the characteristic information.

Optionally, analyzing the data to be processed to obtain the feature information includes:

extracting at least one text message contained in the data to be processed;

performing text classification on the at least one text message to obtain a text category of each text message;

and analyzing the at least one text message according to the text type to obtain the characteristic information corresponding to the data to be processed.

Optionally, constructing a search keyword corresponding to the feature information database based on the question to be searched includes:

extracting information of the problem to be retrieved to obtain a keyword corresponding to the problem to be retrieved;

standardizing the keywords based on entity link rules to obtain retrieval keywords corresponding to the characteristic information database.

Optionally, standardizing the keyword based on an entity link rule to obtain a search keyword corresponding to the feature information database, including:

linking a plurality of candidate entities corresponding to the keywords in a feature information database;

screening a target candidate entity from the plurality of candidate entities based on a screening rule;

and determining the retrieval key words according to the target candidate entities.

Optionally, querying the feature information database according to the search keyword, and using a query result as a response to the search request, includes:

generating at least one candidate path corresponding to the search keyword based on the search keyword and the feature information database;

and determining a target entity according to the at least one candidate path, and using the target entity as a response of the retrieval request.

Optionally, querying the feature information database according to the search keyword includes:

determining a retrieval information keyword according to the retrieval keyword, and querying the characteristic information database based on the retrieval information keyword; or the like, or, alternatively,

determining a retrieval tag keyword according to the retrieval keyword, and querying the characteristic information database based on the retrieval tag keyword; or the like, or, alternatively,

and determining a retrieval information keyword and a retrieval tag keyword according to the retrieval keyword, and querying the characteristic information database based on the retrieval information keyword and the retrieval tag keyword.

inquiring the characteristic information database according to the retrieval key words to obtain at least two pieces of inquiry information;

sequencing the at least two pieces of query information based on a preset sequencing rule to obtain a query result list;

and using the inquiry result list as the response of the retrieval request.

Optionally, after the step of constructing the feature information database associated with the target field based on the feature information in the data to be processed and the feature tag is executed, the method further includes:

detecting the resource information database based on preset duration;

determining change information of the resource information database under the condition that the change of the tag information and/or the data information contained in the resource information database is detected;

updating the feature information database based on the change information.

Optionally, the text data processing method further includes:

receiving new data uploaded by a user;

and processing the newly added data and storing the newly added data into the characteristic information database.

According to a second aspect of embodiments of the present application, there is provided a text data processing apparatus including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire data to be processed corresponding to a target field and a feature tag corresponding to the data to be processed;

a building module configured to build a feature information database associated with the target field based on feature information in the data to be processed and the feature tag;

the processing module is configured to receive a retrieval request carrying a problem to be retrieved and construct a retrieval keyword corresponding to the characteristic information database based on the problem to be retrieved;

and the query module is configured to query the characteristic information database according to the retrieval key words and take the query result as the response of the retrieval request.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is used for storing computer-executable instructions, and the processor realizes the steps of the text data processing method when executing the computer-executable instructions.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the text data processing method.

According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of the text data processing method.

According to the text data processing method, to-be-processed data corresponding to a target field and a feature tag corresponding to the to-be-processed data are obtained; constructing a characteristic information database associated with the target field based on the characteristic information and the characteristic label in the data to be processed; receiving a retrieval request carrying a problem to be retrieved, and constructing a retrieval keyword corresponding to the characteristic information database based on the problem to be retrieved; the feature information database is inquired according to the search key words, the inquiry result is used as the response of the search request, the high-quality knowledge base is quickly constructed, meanwhile, the data in the knowledge base is updated through the expansion of the feature labels, the quick updating of the knowledge base is realized, the question and answer of complex relations are supported, and the accuracy of answer search and the richness of the answers are improved through the standardization of the questions to be searched.

Drawings

Fig. 1 is a structural diagram of a text data processing method according to an embodiment of the present application;

fig. 2 is a flowchart of a text data processing method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a text data processing method applied to resume data processing according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a text data processing apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

Entity linking: entity Linking (EL), which maps entities in text to a given Knowledge Base (KB).

A knowledge base: the knowledge is a knowledge base for gathering knowledge in a specific field, the knowledge is established in a non-structural natural language, and the knowledge is formalized and simplified in a triple expression mode for convenience of computer processing and understanding; the triple in the knowledge base is (entity, entity relationship, entity).

Named entity recognition: named Entity Recognition (NER), refers to recognizing entities in text that have a particular meaning.

The jieba word segmentation is mainly based on a statistical dictionary to construct a prefix dictionary; then, segmenting the input sentence by utilizing the prefix dictionary to obtain all segmentation possibilities, and constructing a directed acyclic graph according to segmentation positions; and calculating to obtain a maximum probability path through a dynamic planning algorithm, thereby obtaining a final segmentation form.

The TextRank algorithm is a graph-based ranking algorithm for keyword extraction and document summarization, and is a ranking algorithm based on the webpage importance of Google: the PageRank algorithm is improved, keywords can be extracted by utilizing co-occurrence information (semantics) among words in a document, the keywords and the keyword groups of the text can be extracted from a given text, and key sentences of the text can be extracted by using an extraction type automatic abstract method.

Target area: it is determined that all things in a specific scope constitute a specific field, such as all film and television resources in the film and television field, all music works in the music field, and the like, and the scope of job hunting, reading, medical treatment and the like.

Characteristic label: identifiable symbolic sentences that can represent things' features.

A characteristic information database: the knowledge base is formed by analyzing data resources in a specific range and combining the analysis results.

And (3) standardization: to achieve optimal order within a certain range, a common and re-used regular activity is enacted on actual or potential problems.

In the present application, a text data processing method is provided. The present application relates to a text data processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

As shown in the structure diagram of the text data processing method in fig. 1, when a user inputs a data resource, a data resource analyzing system analyzes the data resource input by the user to obtain feature information corresponding to the data resource; and determining a characteristic label of the data resource by combining the characteristic information of the data resource and the resource information database, and creating the characteristic information database according to the characteristic label and the characteristic information. When a user inputs a problem to be retrieved, extracting keywords of the problem to be retrieved, standardizing the extracted keywords according to the feature information and the feature labels in the feature information database, retrieving in the feature information database based on the standardized retrieval keywords, and outputting candidate answers corresponding to the problem to be retrieved.

In the embodiment, the data resources input by the user are analyzed, the feature information in the data resources is extracted, the feature labels of the data resources are determined by combining the resource information database, the feature information database is created according to the feature labels and the feature information, when the problem to be retrieved input by the user is received, the keyword extraction is performed on the problem to be retrieved, the extracted keyword is standardized, the keyword is retrieved in the feature information database based on the standardized keyword, and the candidate answer is obtained, so that the high-quality feature information database is quickly constructed, and the question and answer with complex relationships are supported.

Fig. 2 is a flowchart illustrating a text data processing method according to an embodiment of the present application, which specifically includes the following steps:

step S202, to-be-processed data corresponding to a target field and a feature tag corresponding to the to-be-processed data are obtained.

Specifically, the target field is a specific range, and all things existing in the range constitute a specific field, such as all film and television resources in the film and television field, all musical works in the music field, and the like, and also includes the fields of job hunting, reading, medical treatment and the like; the data to be processed refers to data resources corresponding to the target field and requiring further analysis, where the data resources may be of multiple types, including but not limited to picture type,. doc type,. pdf type,. html type,. txt type, and the like, such as related information of any movie and television work in the movie and television field, related information of any musical work in the music field, and for the data processing process in any field, the corresponding description content in this embodiment may be referred to; the feature tag is a plurality of text messages in the data to be processed, which are obtained after the data to be processed are processed, and at least one text message is endowed with a special symbol or mark corresponding to the feature of the text message and capable of being identified, such as the showing time, the type and the like of any film and television work in the film and television field, and a feature tag is respectively endowed with the features corresponding to the film and television work, such as the showing time, the type and the like, so as to facilitate the identification based on the feature tag.

Based on this, when the target field is determined, the multiple types of data resources corresponding to the determined target field are analyzed to respectively obtain multiple pieces of text information corresponding to each data resource, and according to the characteristics of each piece of text information, a characteristic label is given to at least one piece of text information for identifying each piece of text information, wherein the characteristic label can be predetermined, can also be obtained by identifying a characteristic label model, or is labeled with the characteristic label for the text information by a manual labeling method.

In this embodiment, a text data processing method is described by taking a movie and television field as an example of a target field, after movie and television resources in the movie and television field uploaded by a user are acquired, processing is performed on various types of movie and television resources, text information is extracted, after text data is obtained, feature tags are assigned to the text data according to features of the text data, feature tags are assigned to the text data, such as the movie and television resources, the type and the length of the movie and television resources, the feature tags are assigned to text data features, such as obtained awards, when the feature tags are assigned to the movie and television resources, a period of time with the same interval is predetermined as one feature tag, and feature tags used for marking the movie and television time are obtained at intervals of ten years, such as 1990-2000, 2000-2010, and the like; the method for determining the feature tag can be realized through a feature tag model, words representing the types of the movie resources are identified as the feature tag, the feature tags of the types comprise scenarios, histories, suspicions, science fiction, war and the like, and because not every movie resource obtains the awards, the feature tag of the obtained awards is not given to every movie resource. In addition, the above method can be used for processing resources in any field and assigning feature tags in this embodiment.

Further, after the data to be processed in the target field is processed to obtain text information corresponding to each data to be processed, and at least one text information is given with a feature tag, in order to facilitate the user to perform targeted retrieval on the movie and television resources, a movie and television resource library needs to be created based on the text information and the feature tag, when the user wants to watch the movie and television resources within a specific range, the fast retrieval can be realized, specifically, the following is realized:

step S204, constructing a characteristic information database associated with the target field based on the characteristic information in the data to be processed and the characteristic label.

Specifically, the feature information is obtained by analyzing data to be processed to obtain a plurality of text information, each text information can reflect one feature of the data to be processed, one text information is one feature information, and for example, when the data to be processed is movie resource data, the text information such as showing time, type, film length, director, actor and the like corresponding to the movie resource are all feature information; the characteristic information database is a knowledge base composed of characteristic information obtained by analyzing the data to be processed and characteristic labels corresponding to the data to be processed. The characteristic information and the characteristic label which belong to the same data to be processed have an association relation; the characteristic information database is a relational database, when the characteristic information database is constructed, a data writing rule and a data query rule are configured for the characteristic information database, and the configured rule is followed in subsequent writing, modification and query of the characteristic information database.

Based on the method, after the text information which corresponds to the data to be processed and can reflect the characteristics of the data to be processed is obtained, and the corresponding label is given to each text information according to the characteristics of the text information, the text information and the corresponding label are combined to construct a database of the target field, namely the characteristic information database of the target field.

Following the above example, the feature information of the movie resource obtained in the above embodiment, such as the showing time, the type, the length, the obtained award, and the feature tag given to the feature information, where the feature information is the showing time and the corresponding feature tag: 1990-2000, 2000-2010, etc., a movie resource database was created in combination with the above-described feature information and feature tags. All the characteristic information corresponding to one movie resource and the corresponding characteristic label are a record in the resource database, and the extensible movie resource database is formed by a plurality of movie resources.

Further, after the data to be processed is obtained, due to the diversity of types of the data to be processed, the applicable processing methods are also different for different types of data to be processed, so that different analysis methods need to be selected according to different types of data to be processed to obtain the feature information corresponding to the data to be processed, which is specifically implemented as follows:

step S204-2, analyzing the data to be processed to obtain the characteristic information;

specifically, the parsing refers to selecting different data processing methods for different types of data to be processed, obtaining feature information corresponding to the data to be processed by analyzing the data to be processed, where the type of the data to be processed may be a picture type, or a doc type, a pdf type, an html type, a txt type, and the like, and for different types of data to be processed, the selected processing methods are also different, such as data resources of the picture type, extracting relevant text data in the picture, where text extraction may adopt an OCR recognition method, and essentially detects a region where text may exist, then recognizes the region, and extracts recognized text content; the text content can also be detected and extracted by adopting a MSER (maximum Stable extreme value region) algorithm built in the OpenCV; for the pdf type data, the character content in the document can be extracted by adopting an OCR (optical character recognition) method; for data of doc type, the text content can be edited directly.

Further, after the data to be processed is obtained, because the types of the data to be processed are more, different types of data to be processed need to be processed by different methods, and the feature information is obtained from the data to be processed, which is specifically realized as follows:

step S204-2-2, extracting at least one text message contained in the data to be processed;

step S204-2-4, performing text classification on the at least one text message to obtain a text type of each text message;

step S204-2-6, analyzing the at least one text message according to the text type to obtain the feature information corresponding to the data to be processed.

Specifically, the text information refers to text content obtained by performing character extraction and data cleaning on data to be processed in the embodiment; the text type refers to various types of text contents representing different attributes of the text information in the text information, for example, when the data to be processed is movie data, the showing time, the type and the like of the extracted movie works are the text type, when the data to be processed is a resume, the personal information, the educational background, the working experience and the like in the resume are the text type, and the text content contained in one text type corresponds to at least one feature information of the data to be processed.

Based on this, after the data to be processed is obtained, because the types of the data to be processed are various, character extraction is performed on the data to be processed of different types, after the data is cleaned, text contents corresponding to the various types of the data to be processed are obtained, the obtained text contents include all text contents related to the data to be processed, further classification of the text contents is needed, so that feature information corresponding to the data to be processed is more accurately extracted from each class, and finally, representative feature information in the data to be processed is obtained by analyzing the classified text contents.

According to the above example, after the movie and television resources in the movie and television field uploaded by the user are acquired, a proper data processing method is selected for different types of movie and television resources, character extraction and data cleaning are performed to obtain text contents related to the movie and television resources, character extraction is performed on the movie and television resources a, and the text contents obtained after data cleaning are as follows: a director: director a, editing: drama editing 1, drama editing 2, lead actor: lead actor a, lead actor b, lead actor c, type: history, suspense, language: chinese, mapping time: in 2019, the length of the sheet is: 135 minutes, awards obtained: the optimal movie, wherein the director, the drama editor, the lead actor, the genre, the language, the showing time, the movie length, and the obtained awards are a plurality of categories corresponding to the movie resource a, and the characteristic information is respectively extracted from each category, namely the director a corresponding to the director, the drama editor 1 and the drama editor 2 corresponding to the drama editor, the lead actor a, the lead actor b and the lead actor c corresponding to the lead actor, and so on, and the characteristic information is respectively extracted from the categories according to the above method, which is not described herein again.

In conclusion, the video resources are subjected to text classification processing, and then the feature information is extracted from the classified texts, so that the feature information corresponding to the video resources is obtained, and the accuracy of feature information extraction is improved.

In addition, when the feature information database is constructed, since the resource information database already has the reference feature tag, in order to improve the construction effect of the feature information database, the feature information in the data to be processed may be directly matched with the reference feature tag in the resource information database, so as to determine the relationship between the feature information in the data to be processed and the reference feature tag in the resource information database, and construct the feature information database based on the relationship, in this embodiment, the specific implementation manner is as shown in the following steps S204-4 to S204-6.

And S204-4, matching the characteristic information with a reference characteristic label contained in a resource information database, and distributing at least one characteristic label for the data to be processed according to a matching result.

Step S204-6, constructing a characteristic information database associated with the target field based on the characteristic information and at least one characteristic label corresponding to the characteristic information.

Specifically, the resource information database refers to a knowledge base storing data resources related to characteristic information of data to be processed, such as a knowledge base composed of movies winning prizes over the years in the field of movies, and includes related information of a prize name and a prize movie work; the knowledge base consists of directors of film and television works and comprises the ranking of the directors, the names of the works and other related information; the reference feature tag is a symbolic word stored in the resource information database and capable of representing the characteristics of the information data stored in the resource information database, for example, in a knowledge base composed of directors of movie and television works, all directors are ranked according to information such as audience ratings, and the director ten places ahead of the ranking is given the symbolic word of "famous director", which is the reference feature tag.

Based on this, after the feature information of the data to be processed is extracted, in order to enable the extracted feature information to reflect the features of the data to be processed better, by analyzing a knowledge base composed of data resources related to the data to be processed, and taking a symbolic word capable of representing the stored data information in the knowledge base as a reference feature tag, matching the feature information acquired from the data to be processed with the reference feature tag, when a matching condition is met, namely the data information corresponding to the reference feature tag in the knowledge base is the same as the features of the feature information in the data to be processed or the feature similarity is higher than a preset threshold value, allocating the reference feature tag to the feature information in the data to be processed as the feature tag corresponding to the feature information in the data to be processed, wherein when the feature information acquired from the data to be processed is matched with the reference feature tag, the feature information in the data to be processed can be mapped to a reference feature label in a resource information database, and the reference feature label matched with the feature information is determined according to a preset similarity value; and matching the characteristic information with the reference characteristic label by a semantic analysis method to obtain the reference characteristic label of which the matching degree with the characteristic information is higher than a preset matching degree threshold value.

It should be noted that at least one feature tag may exist in one feature information obtained from the data to be processed, or the feature tag may not exist in the one feature information, which is not limited in this embodiment. After determining the feature information and the feature tag of the data to be processed, a feature information database can be created according to the data to be processed, the feature information corresponding to the data to be processed, and the feature tag.

Following the above example, after the feature extraction operation is performed on the movie resource, the feature information of a plurality of categories corresponding to the movie resource a, i.e., director a, drama editor 1, drama editor 2, director a, director b, director third, etc., is obtained. In order to further highlight the characteristics of the video resource a, so as to allocate a feature tag to the feature information in the video resource a, the feature tag is determined by retrieving a reference feature tag corresponding to the feature information of the video resource a from a database related to the feature information corresponding to the video resource a, if the director of the video resource a is director a, looking up the director a from the database storing all the director related information, and using the tag information of the director a in the database as the feature tag of the director a in the video resource a, including: the best director, the ranking condition of the director and the like. The method comprises the steps of completing the determination of the characteristic label of the movie resource A, performing the operations on N movie resources such as a movie resource B, a movie resource C and a movie resource D uploaded by a user, namely text extraction, data cleaning, determining characteristic information, determining the characteristic label, distributing the characteristic label to the characteristic information according to the N movie resources and the characteristic information such as director, drama and the like in a plurality of categories corresponding to the N movie resources, and then creating a movie resource library comprising the N movie resources such as the movie resource A, the movie resource B and the like, wherein N is a positive integer greater than 1. It should be noted that, in the present embodiment, the processing method can be referred to when processing any resource and constructing the knowledge base.

In summary, the characteristic information of the movie and television resources is analyzed, and the database related to the characteristic information is searched to determine the characteristic label of the movie and television resources, so that the creation of the movie and television resource library is completed, the creation of a high-quality knowledge base is realized, and the accuracy of the information in the knowledge base is improved.

After the creation of the feature information database is completed, since the resource information database is continuously updated along with the change of the information data, in order to ensure the accuracy and the real-time performance of the data information in the created feature information database as much as possible, the resource information database needs to be periodically detected, and then the feature information database is updated based on the detection result, which is specifically realized as follows:

step S204-8, detecting the resource information database based on preset duration;

step S204-10, determining the change information of the resource information database under the condition that the change of the label information and/or the data information contained in the resource information database is detected;

and step S204-12, updating the characteristic information database based on the change information.

Specifically, the tag information and the data information refer to resource information corresponding to data resources stored in a resource information database, specifically, the data information refers to text data stored in the resource information database, and the tag information refers to a symbolic word capable of representing characteristics of the text data; the change information refers to actual data formed after data resources are changed due to time factors, social factors and the like, for example, the birth of a new movie and television work and the publication of a movie and television work awarded in the awarding ceremony all affect information in a knowledge base related to the movie and television resources, and at the moment, the changed data in the knowledge base needs to be adjusted and updated.

Based on this, after the creation of the feature information database is completed, since the resource information database is not constant but continuously updated along with the change of the information data, in order to ensure the accuracy and real-time of the data information in the created feature information database as much as possible, a fixed time interval can be selected to periodically detect all resource information databases referred to when the feature information database is created, when the text data in the resource information database and the symbolic words representing the characteristics of the text data are detected to be changed, the feature information database is searched based on the changed information, and the corresponding data information in the feature information database is updated, wherein when the feature information is matched with the benchmark feature tags contained in the resource information database based on the feature information, the resource information database is stored, and comparing the stored resource information database with the original resource information database at fixed intervals to determine which data are changed compared with the original resource information database and the stored resource information database.

Along with the above example, after the creation of the movie resource library is completed, the movie resource library also needs to be maintained periodically, that is, data in the knowledge base referred to when the movie resource library is created changes due to time factors and social factors, and at this time, the data in the movie resource library needs to be updated according to the data change in the knowledge base, where the change in the knowledge base includes: the birth of new film and television works, the publication of winning film and television works on the prize-awarding ceremony, the nomination of winning actors, drama and director, etc. These social activities all affect the information data in the knowledge base related to the movie and television resources, and at this time, the changed data in the related knowledge base needs to be adjusted and updated, and the newly generated data is stored in the related knowledge base.

When the change of the data of the related knowledge base is detected, the corresponding data in the movie resource base is adjusted based on the changed data, so that the movie resource base is updated and maintained regularly, because a prize awarding ceremony is held for the first time recently, the reputation title of the best movie is awarded to the movie resource B, the related information about the movie resource B in the knowledge base related to the movie resource base is changed, because the knowledge base is detected regularly, when the change of the related label information about the movie resource B in the knowledge base is detected, the feature label of the movie resource B in the movie resource base is updated based on the updated content, namely, the feature label of the best movie is added to the movie resource B, and the maintenance and the updating of the movie resource base are realized through the method. It should be noted that, for the setting of the "periodic" period, one month, two months, or the like may be selected, and this embodiment is not limited thereto.

In conclusion, the maintenance and the updating of the movie resource library are completed by periodically detecting the data information stored in the knowledge base related to the movie resource library, so that the accuracy of the resource information in the movie resource library is ensured, the updating operation of the movie resource library is simplified, and the updating efficiency of the movie resource library is improved.

Step S206, receiving a retrieval request carrying a problem to be retrieved, and constructing a retrieval keyword corresponding to the characteristic information database based on the problem to be retrieved.

Specifically, the question to be retrieved refers to a question which is input by a user and is sent by a computer to be answered or explained, and the question needs to be input by the user and retrieved by the computer; the search request refers to a computer instruction to wait for obtaining a response, which is issued by the computer for a search problem, for example, when a user searches for a movie on a video webpage, the name or related information of the movie is input in an input box, and a search instruction is issued by the computer after clicking a search control. The retrieval key words refer to the fact that after key words are extracted from the problem to be retrieved, the obtained key words are converted, and the converted key words are consistent with the expression form of the feature information in the feature information database.

Based on the method, after the characteristic information database is constructed, the retrieval of complex problems can be carried out in the database in a targeted manner, the computer receives a search instruction carrying the problem to be solved, the problem carried in the search instruction is processed after the search instruction is received, the keyword in the problem is extracted, the extracted keyword is matched with the data information stored in the resource information database, and the keyword is standardized to an expression form meeting the query rule in the resource information database.

According to the above example, after the movie resource library is created, the questions to be searched can be searched in the movie resource library according to the questions proposed by the user, so as to obtain answers to the questions. When a computer receives a retrieval request carrying problems, the problems are standardized, and the problems to be retrieved proposed by a user are as follows: the winning movie shown in 2004 extracts the keywords of the problem, namely 2004 and winning, uses the two keywords as entities, adopts an entity link rule to link with characteristic entities in a movie resource library, determines candidate entities in the movie resource library, and uses the candidate entities linked in 2004 as: "2004" and "2004" are used as candidate keywords, and the candidate entities linked to the "winning prize" are: "best movie" and "best visual effect" are determined as candidate keywords, and one or more candidate keywords corresponding to the two keywords "2004", "winning prize" are determined as standardized results, that is, "2004", "best movie", and "best visual effect".

Further, after the problem to be retrieved is determined, since not every word in the problem to be retrieved needs to be retrieved in sequence, the keywords in the problem to be retrieved can be extracted, and for the convenience of retrieving in the feature information database, the extracted keywords are standardized into the word expression form in the feature information database, which is specifically realized as follows:

step S206-2, extracting information of the problem to be retrieved to obtain a keyword corresponding to the problem to be retrieved;

step S206-4, standardizing the keywords based on entity link rules to obtain retrieval keywords corresponding to the characteristic information database.

Specifically, the information extraction refers to extracting and integrating text information with specific reference meanings contained in a text in a unified form, such as extracting keywords in a section of text content; the entity link rule refers to a mapping method when the entities in the text are mapped to the entities in a given knowledge base; normalization refers to processing the result of information extraction in the same manner into a content with a consistent expression format.

Based on the above, on the basis of obtaining the problem to be retrieved, since the problem to be retrieved generally has no normativity and the relationship in the problem to be retrieved is relatively complex, the text of the problem to be retrieved needs to be further analyzed to extract the keyword in the problem to be retrieved, so as to realize the retrieval in the feature information database based on the keyword. The specific entity link rule is that a plurality of paths between a keyword entity and a feature information entity in a feature information database are established in advance by taking the keyword as the entity, one path with the similarity between the keyword entity and the feature information entity higher than a preset threshold value is selected from the plurality of paths, and the feature information entity corresponding to the path is the retrieval keyword corresponding to the keyword entity.

In conclusion, the accuracy of searching the problems with complex relationships is improved by performing information extraction and standardization operation on the problems with complex relationships proposed by the user.

Further, after determining the keyword corresponding to the problem to be retrieved, because the obtained keyword is obtained by extracting information from the problem to be retrieved, it is also necessary to determine a standardized keyword corresponding to the keyword, including determining a plurality of candidate entities that may have a correspondence with the keyword in the feature information database, and then screening the plurality of candidate entities to obtain a target candidate entity, thereby determining the retrieval keyword corresponding to the keyword, which is specifically implemented as follows:

step S206-4-2, linking a plurality of candidate entities corresponding to the keywords in a characteristic information database;

step S206-4-4, screening target candidate entities from the candidate entities based on a screening rule;

and S206-4-6, determining the retrieval key words according to the target candidate entities.

Specifically, the candidate entity refers to a set of objects used for being selected, and one object in the set is an entity; the target candidate entity refers to a selected entity set in the candidate entity set, and one object in the set is a target candidate entity.

Based on the above, after determining the keywords in the problem to be retrieved, based on the feature information and the feature tags stored in the feature information database, the keywords in the problem to be retrieved are linked by using an entity linking rule, the feature information entities and the feature tag entities linked to the feature information database are taken as candidate entities corresponding to the keywords, when only one candidate entity corresponding to one keyword is available, the candidate entity is a target candidate entity corresponding to the keyword, the target candidate entity is the retrieval keyword obtained after the keyword is standardized, when a plurality of candidate entities corresponding to one keyword are available, the similarity between the keyword and each candidate entity needs to be calculated, and the candidate entity with the similarity higher than a preset threshold is selected as the candidate entity corresponding to the keyword, namely the target candidate entity, the target candidate entity is the search keyword obtained after the keyword is standardized.

According to the above example, when a problem which needs to be retrieved and is provided by a user is obtained, firstly, information extraction operation is performed on the problem, and the problem to be retrieved is: "what the movie shown in 2003 was conducted by the first actor, director a, the question is extracted by Jieba segmentation, that is, the keywords in the question are extracted, and the keywords in the question include: the three keywords are respectively linked to a movie resource library, candidate entities in the movie resource library linked to the keywords "primary actor" may be "primary actor", "secondary actor", and the like with "level" and "actor", and candidate entities corresponding to the keywords "primary actor" are determined in the candidate entities, that is, "primary actor" is used as a standardized keyword corresponding to the keywords "primary actor". Based on the same operation, the keywords "director a" and "2003" are respectively standardized to obtain standardized keywords corresponding to the keywords "director a" and "2003", respectively: "director a" and "year 2000-2010". It should be noted that, the method for extracting information includes, but is not limited to, Jieba word segmentation, and an algorithm such as TextRank keyword extraction may also be adopted.

In conclusion, the keywords are standardized based on the entity link rules and the feature information and the feature labels in the movie resource library, so that the keywords to be searched are processed before searching to obtain the storage form in the movie resource library, and the searching accuracy is improved.

And S208, querying the characteristic information database according to the retrieval key words, and taking the query result as the response of the retrieval request.

Specifically, the response of the retrieval request means that the retrieval request is analyzed and responded to the received retrieval request, for example, when a user retrieves a movie on a video webpage, the user inputs the movie in an input box, clicks for searching, namely, the computer sends out a search instruction, and the search instruction is used as feedback for the search instruction of the user, and the computer displays all movie information in the webpage and feeds back the movie information to the user, namely, the search instruction of the computer is responded.

Based on the method, when a retrieval request is received, keywords are obtained by extracting information of the problem to be retrieved, then the keywords are standardized to obtain retrieval keywords corresponding to the keywords, then the characteristic information database can be retrieved based on the obtained retrieval keywords, and the retrieved data is fed back to the user, so that the response to the retrieval request is completed.

Following the above example, when the user enters a question to be retrieved: "what films showing in 2003 were visited by first-class actor and director a, and when clicking the search control, the computer receives the search request, and then the information extraction and standardization processing are carried out on the search question carried by the search request, and the search keyword is obtained: the first-level actors, the director a and the year 2003 are searched in a movie resource library based on the three search keywords, movie resources meeting the conditions corresponding to the three search keywords are searched, and all the searched movie resources are fed back to the user.

Further, after the search keyword corresponding to the problem to be searched is determined, the feature information database can be searched by the search keyword and the corresponding feature data tag thereof based on the corresponding feature data tag of the determined search keyword in the feature information database, which is specifically realized as follows:

step S208-2, determining a retrieval information keyword according to the retrieval keyword, and querying the characteristic information database based on the retrieval information keyword; or the like, or, alternatively,

Specifically, the search information keyword is a keyword corresponding to the feature information; the search tag keyword refers to mark information representing characteristics of the search keyword, if the search keyword 'actor a' is a first-level actor, the 'first-level actor' represents characteristics of the 'actor a', namely, a characteristic data tag of the 'actor a' is the 'first-level actor'.

Based on this, after the search keyword is determined, because the feature information and the feature tag corresponding to the feature information are stored in the feature information database, in order to improve the accuracy of the search, before the feature information database is searched based on the search keyword, the search information keyword corresponding to the search keyword is determined, and then the search is performed in the feature information database according to the determined search information keyword; or determining a retrieval tag keyword corresponding to the retrieval keyword, and retrieving in the characteristic information database according to the determined retrieval tag keyword; or determining the retrieval information key words and the retrieval tag key words corresponding to the retrieval key words, and then retrieving in the characteristic information database according to the determined retrieval information key words and the retrieval tag key words to obtain retrieval results corresponding to the retrieval key words.

According to the method, keyword extraction is carried out on different retrieval problems, after the retrieval keyword corresponding to the problem which is proposed by the user and needs to be retrieved is determined, when the retrieval keyword corresponds to the retrieval information keyword, the determined retrieval keyword is 'actor' and the retrieval information keyword corresponding to the retrieval keyword 'actor' is 'actor grandfather', retrieval is carried out in the movie and television resource library based on the keyword 'actor grandfather' and all records related to the 'actor grandfather' in the movie and television resource library are determined through retrieval.

When the retrieval keyword corresponds to the retrieval keyword, the determined retrieval keyword is 'famous actor', the retrieval tag keyword corresponding to the retrieval keyword 'famous actor' is 'first-class actor' and is retrieved in the movie and television resource library based on the key word of 'first-class actor', and all records related to the 'first-class actor' in the movie and television resource library are determined through retrieval.

When the search keyword corresponds to the search information keyword, the search information keyword comprises the search information keyword and the search tag keyword, the determined search keyword is 'famous actor' and 'male', the search tag keyword corresponding to the search keyword 'famous actor' is 'primary actor', the search information keyword corresponding to the search keyword 'male' is 'male', the search is carried out in a movie and television resource library based on the two keywords of the 'primary actor' and the 'male', all records related to the 'primary actor' in the movie and television resource library are determined firstly through the search, and then all records with the sex being male are determined in all the records related to the 'primary actor'.

In summary, the retrieval information keywords and the retrieval tag keywords corresponding to the retrieval keywords are determined based on the feature information and the feature tags stored in the feature information database, so that the feature information database is retrieved based on the retrieval information keywords and the retrieval tag keywords, and the retrieval accuracy is improved.

After the search keyword is determined, when searching is carried out in the characteristic information database based on the search keyword, a candidate path between the search keyword and the characteristic information is generated firstly, and then a target entity is determined according to the candidate path, which is specifically realized as follows:

step S208-4, generating at least one candidate path corresponding to the search keyword based on the search keyword and the characteristic information database;

step S208-6, determining a target entity according to the at least one candidate path, and using the target entity as a response of the retrieval request.

Specifically, the candidate route refers to an optional route between entities obtained by linking the entities by using an entity linking rule; the target entity is a determined entity corresponding to the search keyword, which is selected from one or more candidate entities corresponding to the candidate paths.

Based on the method, after the retrieval key word corresponding to the problem to be retrieved is determined, an entity link rule is adopted, the retrieval key word is used as an entity and is linked to the feature information entity and/or the feature label entity in the feature information database, a plurality of paths are generated among the linked entities, the similarity of each path is calculated, the paths are sorted based on the similarity, a path similarity threshold value is preset, the entity corresponding to the path with the similarity higher than the path similarity threshold value is determined as a target entity, and the obtained target entity is used as the feedback of the retrieval request.

Along the above example, the entity link rule is adopted to search in the movie resource library based on the search keyword, that is, the search is performed in the movie resource library based on "first-class actor", "director a" and "2003", the search keyword "first-class actor" is taken as an entity, the search keyword "first-class actor" is linked to the characteristic information entity and/or characteristic label entity in the movie resource library, that is, all actor entities with characteristic labels of "first-class actor" in the movie resource library are linked, the search keyword "director a" is taken as an entity, the search keyword "2003" is linked to the characteristic information entity and/or characteristic label entity in the movie resource library, that is, all movie resource entities with characteristic labels of "2000-2010", and finally, taking all determined entities as answers to the questions.

In summary, the entity link rule is adopted to link the search keyword as an entity to the movie resource library and determine the entity in the movie resource library corresponding to the search keyword, so as to determine the answer to the problem to be searched, thereby achieving the purpose of rapidly acquiring the answer to the problem to be searched, shortening the waiting time of the user, improving the user experience, achieving the search through the entity link rule, and improving the accuracy of the search.

When the search is performed in the feature information database based on the search keyword, because the number of the searched query information is large, and the obtained query information and the search keyword have different conformity degrees, in order to feed the query information back to the user more normatively and facilitate the user to look up, the query information needs to be sorted according to a certain sorting rule, which is specifically implemented as follows:

step S208-8, inquiring the characteristic information database according to the retrieval key words to obtain at least two pieces of inquiry information;

s208-10, sequencing the at least two pieces of query information based on a preset sequencing rule to obtain a query result list;

and step S208-12, using the inquiry result list as the response of the retrieval request.

Specifically, in this embodiment, the query information refers to feature information in a feature information database corresponding to a search keyword, which is obtained by querying according to the search keyword, and the retrieved feature information is query information; the sorting rule refers to a preset sorting method for the retrieved query information, and the query information is sorted according to a certain sequence; the query result list is a query information list obtained by arranging the queried query information according to a preset sorting rule.

Based on this, when the above-mentioned retrieval is performed in the feature information database based on the retrieval keyword and a plurality of candidate entities corresponding to the retrieval keyword are determined, the determined candidate entities are the obtained query information, because the number of the retrieved query information is large and the obtained query information is not convenient for the user to read due to different degrees of conformity with the plurality of retrieval keywords, the obtained query information needs to be sorted according to a certain sorting rule, the specific sorting rule can be arranged from the top according to the degrees of conformity between the obtained query information and the plurality of retrieval keywords, or the obtained query information can be sorted according to the sequence of the keywords corresponding to the retrieval keywords in the problem to be retrieved, the sorted query information is fed back to the user as a query result in the form of a query result list, it is required to explain that, the ranking rules for the query information may be predetermined, so the ranking rules include, but are not limited to, the two methods described above.

According to the above example, after the movie resource conforming to the search keyword is obtained by searching in the movie resource library based on the search keyword, because the number of the obtained movie resources is large, the movie resource entities respectively corresponding to the first-class actor, the director a, and the year 2000-2010 obtained by searching need to be arranged based on a certain ordering rule and then fed back to the user who proposes the problem to be searched, the specific ordering method can be arranged according to the conformity degree between the searched movie resource and the first-class actor, the director a, and the year 2000-2010, and the search result is the movie resource a: the common actors stare and the director a holds the director and show in 2003; and (3) video resource B: the first-level actor leads the actor, and the director holds the director, and shows in 2011; and (3) video resource C: leading the first-level actor, holding the director by c, showing in 2012; and (3) movie and television resource D: the first-class actor leads the actor and the director a puts the actor in position, and shows the actor in 2001. When the movie resources A-D are sequenced, the movie resource D which accords with the three search keywords is arranged at the front position, the movie resource A which accords with any two search keywords in the three search keywords is the second, the movie resource B and the movie resource C which accord with one search keyword simultaneously are arranged according to the initial letter of the name of the movie resource, namely the movie resource B is arranged in front of the movie resource C, so that a movie resource list of the movie resource D, the movie resource A, the movie resource B and the movie resource C is obtained, and the sequenced movie resource list is fed back to a user as the response of a search request.

It should be noted that, when the number of the search keywords is more than two, the entity corresponding to each search keyword may be determined first, and then the entities meeting the conditions of more than two search keywords are determined; or determining an entity corresponding to one search keyword, and then searching entities meeting the conditions of the second search keyword in the determined entities until determining entities meeting the conditions of all search keywords, wherein the determined entities are answers corresponding to the problems to be searched, namely responses aiming at the search requests.

In summary, the retrieved answers meeting the question conditions are ranked, the answers with higher conformity are arranged in the first row of the answer list, and the answer list is generated according to the conformity degree, so that the answers with higher conformity with the question are preferentially displayed for the user, and the user experience is improved.

Because the data information in the characteristic information database is not invariable, the characteristic information database can be maintained and updated continuously by developers along with the lapse of time to meet the requirements of users, and the specific implementation is as follows:

step S208-14, receiving the new data uploaded by the user;

and S208-16, processing the newly added data and storing the newly added data into the characteristic information database.

Specifically, the new data is, in this embodiment, that new data to be processed is generated due to a time factor or a social factor after the feature information database is created based on the data to be processed, and at this time, the new data to be processed needs to be processed and then stored in the feature information database, where the new data to be processed is the new data.

Based on this, when the developer regularly maintains the feature information database, the developer may directly upload new data, execute the relevant operations of step S202, and store the uploaded new data as the data to be processed in the feature information database after processing, and the specific execution method of step S202 has been described in detail above and is not described herein again.

Along with the above example, because the movie and television works are continuously created, new movie and television works can be produced at any time, in order to ensure that the movie and television data stored in the movie and television resource library is in a new state as much as possible, after the new movie and television works are shown, maintenance personnel of the movie and television resource library are required to upload the related data information of the new movie and television works, and the related data information is stored in the movie and television resource library after the processing of step S202, so that the update of the movie and television resource library is realized. The newly added data is processed by adopting the processing method of the data to be processed when the characteristic information database is constructed, so that the updating and maintenance of the characteristic information database are simplified, and the updating efficiency of the characteristic information database is improved.

In summary, the movie and television resources are processed, a movie and television resource library is created based on the processed movie and television resources, problem retrieval is performed in the movie and television resource library, the movie and television resources are analyzed, the feature tags are distributed for the movie and television resources based on the resource information database, construction of the movie and television resource library is achieved, when a user retrieves the problem to be retrieved, keyword extraction is performed on the problem to be retrieved input by the user, then the keyword is standardized, query is performed in the movie and television resource library based on the obtained standardized keyword, movie and television resource information meeting conditions is queried, and the movie and television resource information is sequenced to generate a query result list and is returned to the user. The method and the device have the advantages that the high-quality movie resource library is rapidly constructed, the labor cost in the construction of the movie resource library is reduced, meanwhile, the movie resource library supports question answering of complex relations, keyword extraction is carried out on the questions, keyword standardization is carried out, the accuracy and flexibility of question retrieval are improved, and the richness of answers is improved.

The following description will further explain the text data processing method by taking an application of the text data processing method provided by the present application in resume data processing as an example with reference to fig. 3. Fig. 3 shows a processing flow chart of a text data processing method applied to resume data processing according to an embodiment of the present application, which specifically includes the following steps:

step S302, obtaining the resume to be processed and analyzing the resume to be processed to obtain the characteristic information.

In this embodiment, the text data processing method is described by taking resume data as an example.

Specifically, after the resume to be processed is obtained, a plurality of text messages are extracted according to the typesetting sequence of the resume to be processed, so as to achieve the purpose of filtering information irrelevant to the content of the resume to be processed, such as a background picture, a page number, a serial number and the like, wherein the resume to be processed can be documents of various format types, including a doc type, a pdf type, a picture type and the like.

Classifying the extracted text information according to a preset classification rule, wherein the classification rule can be classified according to a plurality of plates in the resume document, such as: personal information: name, gender, age, contact, etc., educational background: school, time of admission and graduation, specialty, school calendar, main school course, etc., work experience: enterprise name, position, main work content, time of job, etc., the skill specialty: language skills, computer skills, other skills, self-assessment: character, hobby, speciality, etc. Classifying the extracted text information according to the classification rules to obtain five types of text information including personal information, education background, work experience, skill speciality, self evaluation and the like, analyzing the five types of text information, and forming feature information corresponding to the resume to be processed by the five types of text information.

And step S304, allocating at least one characteristic label for the resume to be processed based on the characteristic information and the resource information database.

Specifically, the resource information database is a school database which stores information such as school names, school ranks, professional names and professional ranks of schools of different schools; and the enterprise database stores information such as enterprise names, enterprise registration time, enterprise scale, enterprise ranking and the like of different enterprises.

And distributing at least one characteristic label for the resume to be processed based on the characteristic information obtained by analyzing the resume to be processed and the information in the resource information database. For example, after parsing the resume to be processed, the following results are obtained: "zhang san, male, 25 years old, 2 years of work experience, S university, computer specialty, this department, 2015 years of entrance, 2019 years of graduation, company a, back-end development, 2019 years of 7 months of entrance, 2021 years of 3 months of departure" was analyzed in combination with the resource information database to find that the label of "S university" is "major university" and the label of "company a" is "five hundred strong enterprises in the world". In addition, for the personal information of "zhang san", labels are allocated to the personal information according to preset rules, such as: the label of 'age 25 years' is 'youth', 'working experience 2 years' is 'working experience 1-3 years', 'this family' is 'this family', etc.

And S306, constructing a talent base based on the feature information and the feature labels.

After determining all the characteristic information in the three-Zhang resume and at least one characteristic label corresponding to the characteristic information, storing the characteristic information and the characteristic label corresponding to the three-Zhang resume into a talent library as a record in the talent library. And (4) performing the operations of the step (S302) to the step (S304) on other resumes to be processed, processing each resume to be processed into a record in the talent library, storing the record in the talent library, and completing the construction of the talent library.

And S308, extracting information of the problem to be retrieved to obtain a keyword corresponding to the problem to be retrieved.

And extracting information of the problem to be retrieved by adopting a named entity identification method, and extracting the key words in the problem to be retrieved. Taking the problem of the computer family with experience of three years from the graduate of S as an example, information extraction is carried out on the problem by adopting a named entity identification method, and the obtained keywords are respectively 'S big', 'three years of experience', 'computer' and 'family'.

Step S310, standardizing the keywords based on the entity link rule to obtain standardized keywords.

And standardizing the extracted keywords by adopting an entity link rule. When entity linking is carried out, the keywords are used as entities and are linked to a plurality of candidate entities in a talent base, the characteristic information and the characteristic labels in the talent base are the candidate entities, and the candidate entities corresponding to each keyword and having the similarity higher than a threshold value with the candidate entities are screened from the candidate entities based on a preset similarity threshold value according to the similarity between the keyword entities and the candidate entities in the talent base and serve as standardized keywords corresponding to the keywords. When information extraction is carried out on a problem to be retrieved, and obtained keywords are 'S big', 'three-year experience', 'computer', 'Ben' and the like, the four keywords are respectively linked into the talent base, candidate entities in the talent base to which the keyword 'S big' is linked can be all candidate entities with 'S' and 'big', such as 'S university', 'S1 university', 'S2 university', 'S3 university' and the like, and candidate entities with similarity of the keyword 'S big' within a preset threshold range, namely 'S university', are determined as standardized keywords corresponding to the keyword 'S big' in the candidate entities. Based on the same operation, the keywords "three-year experience", "computer", and "family" are respectively standardized to obtain standardized keywords corresponding to the keywords "three-year experience", "computer", and "family".

Step S312, inquiring the talent base according to the standardized keywords to obtain candidate information.

And querying the candidate in the talent library according to the obtained standardized key words by adopting an entity linking rule. And when the standardized keywords are characteristic information in the talent base, the standardized keywords are linked to the talent base to form a plurality of link paths, and talent entities in the talent base corresponding to all the link paths are used as candidates to obtain candidate information.

In addition, the keywords extracted from the questions to be retrieved may also be in the form of keyword feature tags, where the keyword feature tags refer to feature tags corresponding to standardized keywords, for example, in "talent base," feature tag of "S university" is "focus university," and feature tag of "company a" is "five hundred powerful enterprises in the world. For example, when the problem to be retrieved is: when the key university graduates with three-year work experience are 'the key words extracted from the' key university graduates 'are' three-year work experience ',' key university 'and' the 'home department', it is obvious that 'key university' does not refer to a specific school name, but is a general name of a university, namely when the 'key university' is extracted as a key word, the 'key university' is a key word feature tag, and the key word feature tag and other standardized key words are inquired in a talent base together to obtain a plurality of candidate information meeting conditions.

The obtained eligible candidate information may be that candidate a: male, 24 years old, work experience one year, university of S, this department, candidate a corresponds to the label "youth, work experience 1-3 years, focus university, this department"; candidate B: male, 28 years old, three years of work experience, university of S, student, candidate b corresponds to the label "youth, work experience 1-3 years, focus university, student"; candidate person c: female, 26 years old, three years of work experience, university D, this family, candidate c corresponds to a label of "youth, work experience 1-3 years, non-major university, this family"; d, candidate people: male, 30 years old, working experience six years old, university F, this department, the label corresponding to the candidate population is "youth, working experience 5-7 years old, non-major university, this department".

And step S314, sorting the resumes corresponding to the candidate information based on a preset sorting rule to obtain a resume sorted list corresponding to the problem to be retrieved.

The candidate is inquired according to each standardized keyword and the keyword feature tag corresponding to the standardized keyword, so that the situation that one candidate accords with one or more standardized keywords and the keyword feature tag corresponding to the standardized keyword exists, at the moment, the candidate needs to be sorted according to a sorting rule, namely, the conformity degree, so that the candidates are sequentially arranged according to the conformity degree from high to low to form a candidate list as a resume ranking list corresponding to the problem to be retrieved, wherein the conformity degree is determined according to the keyword feature tag corresponding to the standardized keyword and the keyword feature tag corresponding to the standardized keyword, the number of paths linked with feature information in each resume information in the talent bank is determined, and the more the number of the paths linked, the higher the conformity degree is indicated.

Following the above example, the problem to be retrieved is: the key words extracted from the college graduates of the department of major university with three-year work experience are ' three-year work experience ', ' major university ' and ' the subject ', the standardized key words are ' work experience 1-3 years, major university and subject ', candidate information which meets the conditions and is obtained by inquiring in a talent base is ' A ', B ', C ' and D ' respectively, the candidate information is sorted, the standardized key words which the candidate A meets are ' work experience 1-3 years, major university and subject ', the standardized key words which the candidate B meets are ' work experience 1-3 years and major university ', the standardized key words which the candidate C meets are ' work experience 1-3 years and subject ', and the standardized key words which the candidate D meets are ' subject '. Analysis shows that the candidate A meets all the standardized keywords, the candidate B and the candidate C meet two of the three standardized keywords, and the candidate D only meets one standardized keyword, so the obtained resume ranked list is the candidate A, the candidate B/the candidate C and the candidate D.

In summary, when the resume data is processed and candidate retrieval is performed on a database composed of resume data, the construction of the talent base is realized by analyzing the resume data and distributing characteristic tags for the resume data based on the resource information database, when a user performs candidate retrieval, keyword extraction is performed on a problem text input by the user, then standardized operation is performed on the keywords, query is performed in the talent base based on the obtained standardized keywords, candidates meeting conditions are obtained, and a resume list is generated and returned to the user after the candidates are sorted. The method and the device realize the rapid construction of the high-quality talent base, simultaneously support the question answering of complex relationships, improve the accuracy and flexibility of question retrieval and improve the richness of answers by extracting keywords of the questions and standardizing the keywords.

Corresponding to the above method embodiment, the present application further provides a text data processing apparatus embodiment, and fig. 4 shows a schematic structural diagram of a text data processing apparatus provided in an embodiment of the present application. As shown in fig. 4, the apparatus includes:

an obtaining module 402, configured to obtain to-be-processed data corresponding to a target field and a feature tag corresponding to the to-be-processed data;

a building module 404 configured to build a feature information database associated with the target field based on the feature information in the data to be processed and the feature tag;

the processing module 406 is configured to receive a retrieval request carrying a problem to be retrieved, and construct a retrieval keyword corresponding to the feature information database based on the problem to be retrieved;

and the query module 408 is configured to query the feature information database according to the search keyword, and take the query result as the response of the search request.

In an optional embodiment, the obtaining module 402 is further configured to:

analyzing the data to be processed to obtain the characteristic information; matching the characteristic information with a reference characteristic label contained in a resource information database, and distributing at least one characteristic label for the data to be processed according to a matching result; wherein the feature information database associated with the target field is constructed based on the feature information and at least one feature tag corresponding to the feature information.

In an optional embodiment, the obtaining module 402 is further configured to:

extracting at least one text message contained in the data to be processed; performing text classification on the at least one text message to obtain a text category of each text message; and analyzing the at least one text message according to the text type to obtain the characteristic information corresponding to the data to be processed.

In an optional embodiment, the processing module 406 is further configured to:

extracting information of the problem to be retrieved to obtain a keyword corresponding to the problem to be retrieved; standardizing the keywords based on entity link rules to obtain retrieval keywords corresponding to the characteristic information database.

In an optional embodiment, the processing module 406 is further configured to:

linking a plurality of candidate entities corresponding to the keywords in a feature information database; screening a target candidate entity from the plurality of candidate entities based on a screening rule; and determining the retrieval key words according to the target candidate entities.

In an optional embodiment, the query module 408 is further configured to:

generating at least one candidate path corresponding to the search keyword based on the search keyword and the feature information database; and determining a target entity according to the at least one candidate path, and using the target entity as a response of the retrieval request.

In an optional embodiment, the query module 408 is further configured to:

inquiring the characteristic information database according to the retrieval key words to obtain at least two pieces of inquiry information; sequencing the at least two pieces of query information based on a preset sequencing rule to obtain a query result list; and using the inquiry result list as the response of the retrieval request.

In an optional embodiment, the building module 404 is further configured to:

detecting the resource information database based on preset duration; determining change information of the resource information database under the condition that the change of the tag information and/or the data information contained in the resource information database is detected; updating the feature information database based on the change information.

In an optional embodiment, the text data processing apparatus is further configured to:

receiving new data uploaded by a user; and processing the newly added data and storing the newly added data into the characteristic information database.

The above is a schematic configuration of a text data processing apparatus of the present embodiment. It should be noted that the technical solution of the text data processing apparatus and the technical solution of the text data processing method belong to the same concept, and details that are not described in detail in the technical solution of the text data processing apparatus can be referred to the description of the technical solution of the text data processing method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

Fig. 5 illustrates a block diagram of a computing device 500 provided according to an embodiment of the present application. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.

Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the application, the above-described components of computing device 500 and other components not shown in FIG. 5 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.

Wherein processor 520 is configured to execute the computer-executable instructions of the text data processing method.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text data processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the text data processing method.

An embodiment of the present application also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are used for a text data processing method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text data processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text data processing method.

An embodiment of the present application further provides a chip, in which a computer program is stored, and the computer program implements the steps of the text data processing method when executed by the chip.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A text data processing method, comprising:

2. The method according to claim 1, wherein the obtaining of the feature tag corresponding to the data to be processed comprises:

analyzing the data to be processed to obtain the characteristic information;

3. The method according to claim 2, wherein the analyzing the data to be processed to obtain the feature information comprises:

extracting at least one text message contained in the data to be processed;

4. The method according to claim 1, wherein the constructing of the search keyword corresponding to the feature information database based on the question to be searched comprises:

5. The method of claim 4, wherein the normalizing the keyword based on the entity linking rule to obtain a search keyword corresponding to the feature information database comprises:

6. The method according to claim 1, wherein said querying the feature information database according to the search keyword and using the query result as the response of the search request comprises:

7. The method of claim 1, wherein the querying the feature information database according to the search keyword comprises:

8. The method according to claim 1, wherein said querying the feature information database according to the search keyword and using the query result as the response of the search request comprises:

and using the inquiry result list as the response of the retrieval request.

9. The method according to claim 1, wherein after the step of constructing the feature information database associated with the target domain based on the feature information in the data to be processed and the feature tag is performed, the method further comprises:

detecting a resource information database based on preset duration;

updating the feature information database based on the change information.

10. The method of claim 1, further comprising:

receiving new data uploaded by a user;

11. A text data processing apparatus, characterized by comprising:

12. A computing device, comprising:

a memory and a processor;

the memory is used for storing computer-executable instructions, and the processor is used for executing the computer-executable instructions to realize the steps of the text data processing method in any one of claims 1 to 10.

13. A computer-readable storage medium storing computer instructions, which when executed by a processor implement the steps of the text data processing method according to any one of claims 1 to 10.