CN117390146A - Question-answering system oriented to judicial examination field, data processing method and terminal - Google Patents

Question-answering system oriented to judicial examination field, data processing method and terminal Download PDF

Info

Publication number
CN117390146A
CN117390146A CN202311128057.5A CN202311128057A CN117390146A CN 117390146 A CN117390146 A CN 117390146A CN 202311128057 A CN202311128057 A CN 202311128057A CN 117390146 A CN117390146 A CN 117390146A
Authority
CN
China
Prior art keywords
vector
recall
document
questions
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311128057.5A
Other languages
Chinese (zh)
Inventor
陈旭阳
杨旭川
刘琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Juexiao Technology Co ltd
Original Assignee
Chongqing Juexiao Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Juexiao Technology Co ltd filed Critical Chongqing Juexiao Technology Co ltd
Priority to CN202311128057.5A priority Critical patent/CN117390146A/en
Publication of CN117390146A publication Critical patent/CN117390146A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Technology Law (AREA)
  • Human Computer Interaction (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a data processing method of a question-answering system oriented to the judicial examination field, which comprises the following steps: arranging data corresponding to the teaching materials, the subjective questions, the objective questions and the French into text documents, and processing the text documents to form corresponding data sets; vectorizing a text document by adopting a text vector model to obtain a vector set, and storing vectors and corresponding information into a corresponding knowledge base; vectorizing the problem by adopting a text vector model to obtain a problem vector, calculating the similarity degree of the problem vector and each vector in a vector set by adopting cosine similarity, calculating the correlation between the problem and a paragraph of a data set by adopting a BM25 algorithm, and multiplying the similarity degree and the correlation to obtain the comprehensive correlation between the problem and a document; and integrating recall content and prompt words to form input of a large language model, and then obtaining output serving as answers corresponding to the questions. The method can accurately find the answers of the questions, and has high answer accuracy and high flexibility.

Description

Question-answering system oriented to judicial examination field, data processing method and terminal
Technical Field
The invention relates to the technical field of online education, in particular to a question-answering system oriented to the judicial examination field, a data processing method, a terminal and a medium thereof.
Background
In the field of on-line education of judicial examination, a plurality of students cannot learn and have questions, and professional teachers cannot answer one to one in time, so that it is necessary to develop a question-answering system for judicial examination, and the questions of the students can be answered in real time.
The question-answering system in the prior art is inflexible, has low accuracy, cannot answer accurately, and depends on the question of richness of question-answering libraries and the like:
for example: the patent application proposes a Chinese law question-answering system based on an intention recognition and twinning network, which is disclosed by a patent publication number CN113326364A, wherein a bert pre-training language model is utilized to vectorize legal questions, then cosine similarity calculation is respectively carried out on the questions proposed by a user and corresponding vectors in a question bank, and an answer corresponding to the most similar questions is selected to be returned as an answer of the user input questions. The method depends on the richness of the collected question library, if the collected questions are not rich enough, the situation that the questions of the user cannot be matched in the question library exists, so that the phenomenon of answering the questions is caused, and on the other hand, the questions always answer the same answer exist for different questions.
For example: the patent application provides an intelligent interaction system and method based on legal consultation, which is disclosed in the patent publication No. CN111400453A, wherein legal consultation information input by a user is firstly subjected to grammar conversion, then intention recognition is carried out on the legal consultation information after the grammar conversion, finally a corresponding database is selected according to intention types, and related problems are matched and searched, and then answers are summarized. The method has low accuracy, cannot accurately find out the corresponding questions, can only output the answers of a plurality of questions after being summarized, and has low flexibility of answers.
Disclosure of Invention
Aiming at the defects in the prior art, the question-answering system oriented to the judicial examination field, the data processing method, the terminal and the medium thereof provided by the invention can accurately answer questions in the judicial examination field, which are proposed by students, and have high answer flexibility.
In a first aspect, a data processing method of a question-answering system facing to the judicial examination field provided by the embodiment of the invention includes:
arranging data corresponding to judicial examination teaching materials, subjective questions, objective questions and laws into text documents, and processing the text documents to form corresponding data sets;
vectorizing the text document by adopting a text vector model to obtain a vector set, wherein the vector set comprises a paragraph vector, an objective question stem vector, a common vector of subjective question stems and questions, and a vector of normal names and contents, and storing the vector and corresponding information into a corresponding knowledge base;
obtaining a problem input by a user, vectorizing the problem by adopting a text vector model to obtain a problem vector, calculating the similarity degree of the problem vector and each vector in a vector set by adopting cosine similarity, sorting out the vocabulary in the legal field in the textbook, objective questions, subjective questions and legal strips to form a legal dictionary, adding the legal dictionary into a custom dictionary of jieba segmentation, calculating the relevance of the problem and the paragraphs of a dataset by adopting a BM25 algorithm, and multiplying the similarity degree and the relevance to obtain the comprehensive relevance of the problem and the document;
comparing the comprehensive correlation with a set threshold value, if the comprehensive correlation is larger than the set threshold value, recalling the documents, forming a plurality of recall documents into a recall document set, and obtaining a recall content set by adopting different recall strategies according to different contents in the recall documents;
integrating the recall content set and the prompt word to form input content of a large language model, inputting the input content into the large language model, and generating corresponding answers by the large language model according to the provided input content and the questions.
In a second aspect, the question-answering system for judicial examination field provided by the embodiment of the invention comprises a data set module, a knowledge base construction module, a search engine module, a content recall module and a large language model module;
the data set module sorts data corresponding to the judicial examination teaching materials, the subjective questions, the objective questions and the French into text documents, and processes the text documents to form corresponding data sets;
the knowledge base construction module adopts a text vector model to carry out vectorization on a text document to obtain a vector set, wherein the vector set comprises a paragraph vector, an objective question stem vector, a common vector of subjective question stems and questions, a normal name and a vector of contents, and the vector and corresponding information are stored in a corresponding knowledge base;
the search engine module obtains the problem input by a user, the problem is vectorized by adopting a text vector model to obtain a problem vector, the similarity degree of each vector in a vector set is calculated by adopting cosine similarity, vocabulary in legal fields in teaching materials, objective questions, subjective questions and legal strips is sorted out and added into a custom dictionary of jieba segmentation, the correlation of the problem and paragraphs of a data set is calculated by adopting a BM25 algorithm, and the similarity degree and the correlation are multiplied to obtain comprehensive correlation;
the content recall module is used for comparing the comprehensive correlation with a set threshold value, if the comprehensive correlation is larger than the set threshold value, recalling the documents, forming a plurality of recall documents into a recall document set, and obtaining a recall content set by adopting different recall strategies according to different contents in the recall documents;
the large language model module is used for integrating the recall content set and the prompt word to form input content of a large language model, inputting the input content into the large language model, and generating corresponding answers by the large language model according to the provided input content and the questions.
In a third aspect, an embodiment of the present invention provides an intelligent terminal, including a processor, an input device, an output device, and a memory, where the processor is connected to the input device, the output device, and the memory, respectively, and the memory is configured to store a computer program, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method described in the foregoing embodiment.
In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method described in the above embodiments.
The invention has the beneficial effects that:
according to the data processing method of the question-answering system facing the judicial examination field, the large-scale language model is combined with the vector database, meanwhile, the improved BM25 algorithm and the vector search engine are combined to conduct accurate search according to different problems of users, different search strategies are adopted to find out content fragments most relevant to questions of students in an original text, specific prompt words are adopted to process the content fragments, the large-scale language model can filter irrelevant contents, and flexible answering can be conducted according to different question-asking languages and modes, so that answers of the problems can be accurately found, and the answer accuracy and the flexibility are high.
The question-answering system for the judicial examination field combines the large-scale language model with the vector database, combines the improved BM25 algorithm and the vector search engine to accurately search according to different questions of a user by adopting different search strategies, finds out the most relevant content fragments in the original text and the questions of the learner, adopts specific prompt words, and delivers the specific prompt words to the large-scale language model for processing, wherein the large-scale language model can filter irrelevant contents on one hand, and can flexibly answer different question languages and modes on the other hand.
The embodiment of the invention provides an intelligent terminal and a computer readable storage medium, which have the same beneficial effects as the data processing method of the question-answering system facing the judicial examination field, because of the same inventive concept.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale.
Fig. 1 is a flowchart showing a data processing method of a question-answering system oriented to the field of judicial examination according to a first embodiment of the present invention;
fig. 2 is a block diagram of a question-answering system for judicial examination according to another embodiment of the present invention;
fig. 3 is a block diagram of an intelligent terminal according to another embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention pertains.
Referring to fig. 1, a data processing method of a question-answering system for judicial examination field provided by a first embodiment of the present invention includes the following steps:
arranging data corresponding to judicial examination teaching materials, subjective questions, objective questions and laws into text documents, and processing the text documents to form corresponding data sets;
vectorizing the text document by adopting a text vector model to obtain a vector set, wherein the vector set comprises a paragraph vector, an objective question stem vector, a common vector of subjective question stems and questions, and a vector of normal names and contents, and storing the vector and corresponding information into a corresponding knowledge base;
vectorizing a problem input by a user by adopting a text vector model to obtain a problem vector, calculating the similarity degree of the problem vector and each vector in a vector set by adopting cosine similarity, sorting out the vocabulary in legal fields of teaching materials, objective questions, subjective questions and legal strips to form a legal dictionary, adding the legal dictionary into a custom dictionary of jieba segmentation, calculating the relevance of the problem and a paragraph of a data set by adopting a BM25 algorithm, and multiplying the similarity degree and the relevance to obtain the comprehensive relevance of the problem and the document;
comparing the comprehensive correlation with a set threshold value, if the comprehensive correlation is larger than the set threshold value, recalling the documents, forming a plurality of recall documents into a recall document set, and obtaining a recall content set by adopting different recall strategies according to different contents in the recall documents;
integrating the recall content set and the prompt word to form input content of a large language model, inputting the input content into the large language model, and generating corresponding answers by the large language model according to the provided content and the questions.
Specifically, the data processing method of the question-answering system facing the judicial examination field provided by the embodiment of the invention comprises the following steps:
data set arrangement: the judicial examination field relates to subjects such as criminal law, civil law, criminal litigation law, marketing law, economic law, constitution law and law, the textbooks corresponding to the subjects are arranged into text documents, the text documents are divided according to line-changing symbols, and each paragraph represents a piece of text after empty lines are filtered, so that a textbook document data set J with M paragraphs is formed, and M textbook paragraphs are arranged in the J. Objective questions for judicial examination consist of 4 parts: the stem, the answer, the analysis and the question number form a data set O. The subjective questions consist of four parts of stems, questions, answers and question numbers, and the stems and the questions of the subjective questions are spliced to form a data set S. The French strip is composed of 2 parts of names and contents, and a data set I is formed after the 2 parts are spliced.
Creating a knowledge base: the step needs to use a text-compressing-ada-002 text vector model, and each paragraph in the text document data set J is input into the text vector model to obtain a corresponding paragraph vector representation J vec Representing the vector of the paragraph text and the paragraph J vec The paragraph id is stored in the teaching material knowledge base of the vector database. Similarly, the stem of the objective question can be converted into a one-dimensional vector O through a text vector model vec Then the stem, answer, analysis, question number and vector O of the stem of the objective question vec The representation is stored in an objective question knowledge base. The knowledge base construction of subjective questions requires that the subjective questions stem and the questions are spliced, and then text vector models are inputObtaining a common vector representation S of the subjective question stems and questions vec Representing the common vector S vec The stem, question, answer and question number are stored in a subjective question knowledge base. Similarly, after the legal names and the contents are spliced, a text vector model is input, and then a vector representation I of the legal names and the contents is obtained vec Vector I vec And storing the legal names and the contents into a legal knowledge base.
Creating a composite search engine: the question input by the user is represented by Q, and the text-embedding-ada-002 of the text vector model input by Q is used for obtaining the vector representation Q of the question input by the user vec Calculate Q vec And J vec ,O vec ,S vec ,I vec Similarity of each vector in the vector set, and calculating a problem Q by adopting a cosine similarity algorithm vec And knowledge base document J vec_i Is related to the degree of correlation of (2):
wherein e vec_i Representing vector set J vec 、O vec 、S vec 、I vec Any vector representation of the ith fragment e of the document in (a).
And (3) independently sorting out the vocabularies related to legal fields in the teaching materials, objective questions, subjective questions and legal strips, such as some criminal nouns, proper nouns and the like, forming a legal dictionary, and adding the legal dictionary into a custom dictionary of jieba segmentation.
The BM25 algorithm is used to calculate the correlation of the problem Q with paragraph h of the dataset J, O, S, I:
where h represents the data in the data set J, O, S, IOne paragraph, q i Representing one word after the word segmentation of the question Q, f i Representation word q i The frequency of occurrence in document d, dl represents the number of words of document d, avgdl represents the average number of words, k, of all documents 1 And b is a superparameter, k 1 Typically 1, b 0.75, IDF (q i ) Represents q i The inverse document frequency of (2) is calculated as follows:
wherein n (q i ) Indicating the presence of the word q in the data set J, 0, S, I i N represents the number of all documents in the data set J, 0, S, I.
The overall correlation of the problem Q with the document d can be obtained by the cosine similarity and BM25 algorithm above as follows:
Score=Score(Q vec ,e vec_i )*Score(Q,h)
recall content: setting a threshold=0.5, and recalling the document d if the calculated integrated correlation of Q and d is greater than 0.5, indicating that the correlation of Q and d is great. The recalled documents D form a recalled document set D, the recalled documents are arranged according to the descending order of the comprehensive relevance scores, if the recalled documents exist in the recalled document set D, the id of the textbook document with the largest comprehensive relevance score in the recalled documents is taken, and the document with the id of the textbook document is spliced and combined with the textbook document adjacent to the textbook document, for example: combining the teaching material documents corresponding to id+1, id+2 and … id+m with the document with the teaching material document as id, wherein m is a number less than or equal to 5, so that the content of the context can be expanded, and the teaching material documents adjacent to each other are not required to be too long, and the input length of the large language model can be exceeded. If the objective problem document or the subjective problem document exists in the recall document set, corresponding answers and analysis are obtained according to the problem numbers corresponding to the problem stems, and recall content sets corresponding to the input problems Q are obtained and arranged as C=in descending order of comprehensive relevance scores
[ d1, d2, d3, d4, … … dn ]. C comprises teaching material documents, objective problem analysis, subjective problem analysis and French content.
And taking the top K documents from the recalled content set C, wherein K is an integer, so that the length of the K documents does not exceed the context processing limit of the large language model, and splicing the top K documents to serve as contexts. The context is spliced into the following hint words:
Context information is below.
---------------------
{ context }
---------------------
Using the provided context information,write a comprehensive reply to the given query.
Use prior knowledge only if the given context didn't provide enough information.
Answer the question { problem }
After the context and the question Q are spliced into the prompt words, input content is formed, the input content is input into a large language model, and then the large language model generates corresponding replies according to the provided input content and questions.
According to the data processing method of the question-answering system facing the judicial examination field, the large-scale language model is combined with the vector database, meanwhile, the improved BM25 algorithm and the vector search engine are combined to conduct accurate search according to different problems of users, different search strategies are adopted to find out content fragments most relevant to questions of students in an original text, specific prompt words are adopted to process the content fragments, the large-scale language model can filter irrelevant contents, and flexible answering can be conducted according to different question-asking languages and modes, so that answers of the problems can be accurately found, and the answer accuracy and the flexibility are high.
In the first embodiment, a data processing method of a question-answering system oriented to the judicial examination field is provided, and correspondingly, the application also provides a question-answering system oriented to the judicial examination field. Referring to fig. 2, a block diagram of a question-answering system for judicial examination is provided in a second embodiment of the present invention. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
The invention provides a question-answering system for the judicial examination field, which comprises a data set module, a knowledge base construction module, a search engine module, a content recall module and a large-scale language model module; the data set module sorts the data corresponding to the judicial examination teaching materials, the subjective questions, the objective questions and the French into a text document, and processes the text document to form a corresponding data set; the knowledge base construction module carries out vectorization on a text document by adopting a text vector model to obtain a vector set, wherein the vector set comprises a paragraph vector, an objective question stem vector, a common vector of subjective question stems and questions, a normal name and a vector of contents, and the vector and corresponding information are stored in a corresponding knowledge base; the method comprises the steps that a search engine module obtains a problem input by a user, the problem is vectorized by a text vector model to obtain a problem vector, the similarity degree of the problem vector and each vector in a vector set is calculated by cosine similarity, vocabulary in legal fields in teaching materials, objective questions, subjective questions and legal strips is sorted out and added into a custom dictionary of jieba segmentation, the relevance of the problem and a paragraph of a data set is calculated by a BM25 algorithm, and the similarity degree and the relevance are multiplied to obtain comprehensive relevance; the content recall module is used for comparing the comprehensive correlation with a set threshold value, if the comprehensive correlation is larger than the set threshold value, recalling the documents, forming a plurality of recall documents into a recall document set, and obtaining a recall content set by adopting different recall strategies according to different contents in the recall documents; the large language model module is used for integrating the recall content set and the prompt word to form input content of a large language model, inputting the input content into the large language model, and generating corresponding answers by the large language model according to the provided input content and the questions. The text vector model is text-embedding-ada-002.
The content recall module comprises a content set generation unit, wherein the content set generation unit is used for arranging recall document sets in a descending order according to comprehensive relevance; if the teaching material document exists in the recall document set, acquiring the id of the teaching material document with the largest comprehensive relevance score in the recall document, and combining the document with the id with the adjacent teaching material document to obtain recall content corresponding to the input problem; if the objective problem document or the subjective problem document exists in the recall document set, acquiring corresponding answers and analyzing according to the problem numbers corresponding to the problem stems, and obtaining recall contents corresponding to the input problems; and arranging the obtained recall content sets in descending order of the comprehensive relevance score.
The large language model module comprises an integration unit, wherein the integration unit is used for taking out the first K documents in the recall content set, K is an integer, so that the length of the K documents is not more than the context processing limit of the large language model, splicing the first K documents as context into a prompt word after splicing, and splicing the context and the problem into the prompt word to form the input content of the large language model.
The system also comprises a user interaction module for providing a question and answer interface and knowledge base options, wherein the learner can select the corresponding knowledge base to answer questions and answer, and the learner can answer based on all knowledge bases by default when not selecting a specific knowledge base.
The question-answering system for the judicial examination field combines the large-scale language model with the vector database, combines the improved BM25 algorithm and the vector search engine to accurately search according to different questions of a user by adopting different search strategies, finds out the most relevant content fragments in the original text and the questions of the learner, adopts specific prompt words, and delivers the specific prompt words to the large-scale language model for processing, wherein the large-scale language model can filter irrelevant contents on one hand, and can flexibly answer according to different question languages and modes on the other hand.
Referring to fig. 3, a schematic structural diagram of an intelligent terminal according to another embodiment of the present invention includes a processor, an input device, an output device, and a memory, where the processor is connected to the input device, the output device, and the memory, respectively, and the memory is used to store a computer program, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method described in the foregoing embodiments.
It should be appreciated that in embodiments of the present invention, the processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The input devices may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, etc., and the output devices may include a display (LCD, etc.), a speaker, etc.
The memory may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
In a specific implementation, the processor, the input device, and the output device described in the embodiments of the present invention may execute the implementation described in the method embodiment provided in the embodiments of the present invention, or may execute the implementation of the system embodiment described in the embodiments of the present invention, which is not described herein again.
In a further embodiment of the invention, a computer-readable storage medium is provided, in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method described in the above embodiment.
The computer readable storage medium may be an internal storage unit of the terminal according to the foregoing embodiment, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used to store the computer program and other programs and data required by the terminal. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working procedures of the terminal and the unit described above may refer to the corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In several embodiments provided in the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

Claims (10)

1. A data processing method of a question-answering system oriented to the judicial examination field is characterized by comprising the following steps:
arranging data corresponding to judicial examination teaching materials, subjective questions, objective questions and laws into text documents, and processing the text documents to form corresponding data sets;
vectorizing the text document by adopting a text vector model to obtain a vector set, wherein the vector set comprises a paragraph vector, an objective question stem vector, a common vector of subjective question stems and questions, and a vector of normal names and contents, and storing the vector and corresponding information into a corresponding knowledge base;
obtaining a problem input by a user, vectorizing the problem by adopting a text vector model to obtain a problem vector, calculating the similarity degree of the problem vector and each vector in a vector set by adopting cosine similarity, sorting out the vocabulary in the legal field in the textbook, objective questions, subjective questions and legal strips to form a legal dictionary, adding the legal dictionary into a custom dictionary of jieba segmentation, calculating the relevance of the problem and the paragraphs of a dataset by adopting a BM25 algorithm, and multiplying the similarity degree and the relevance to obtain the comprehensive relevance of the problem and the document;
comparing the comprehensive correlation with a set threshold value, if the comprehensive correlation is larger than the set threshold value, recalling the documents, forming a plurality of recall documents into a recall document set, and obtaining a recall content set by adopting different recall strategies according to different contents in the recall documents;
integrating the recall content set and the prompt word to form input content of a large language model, inputting the input content into the large language model, and generating corresponding answers by the large language model according to the provided input content and the questions.
2. The method of claim 1, wherein the specific method for obtaining the recall content collection by using different recall policies according to content in the recall document comprises:
arranging the recall document sets in descending order of comprehensive relevance;
if the teaching material document exists in the recall document set, acquiring the id of the teaching material document with the largest comprehensive relevance score in the recall document, and combining the document with the id with the adjacent teaching material document to obtain recall content corresponding to the input problem;
if the objective problem document or the subjective problem document exists in the recall document set, acquiring corresponding answers and analyzing according to the problem numbers corresponding to the problem stems, and obtaining recall contents corresponding to the input problems;
and arranging the obtained recall content sets in descending order of the comprehensive relevance score.
3. The method of claim 2, wherein the specific method of integrating the recall content collection with the hint words to form the input content of the large language model comprises:
and taking out the first K documents in the recall content set, wherein K is an integer, so that the length of the K documents does not exceed the context processing limit of the large language model, splicing the first K documents as contexts into the prompt word after splicing, and splicing the contexts and the questions into the prompt word to form the input content of the large language model.
4. The method of claim 1, wherein the text vector model is text-casting-ada-002.
5. The question-answering system oriented to the judicial examination field is characterized by comprising a data set module, a knowledge base construction module, a search engine module, a content recall module and a large-scale language model module;
the data set module sorts data corresponding to the judicial examination teaching materials, the subjective questions, the objective questions and the French into text documents, and processes the text documents to form corresponding data sets;
the knowledge base construction module adopts a text vector model to carry out vectorization on a text document to obtain a vector set, wherein the vector set comprises a paragraph vector, an objective question stem vector, a common vector of subjective question stems and questions, a normal name and a vector of contents, and the vector and corresponding information are stored in a corresponding knowledge base;
the search engine module obtains the problem input by a user, the problem is vectorized by adopting a text vector model to obtain a problem vector, the similarity degree of each vector in a vector set is calculated by adopting cosine similarity, vocabulary in legal fields in teaching materials, objective questions, subjective questions and legal strips is sorted out and added into a custom dictionary of jieba segmentation, the correlation of the problem and paragraphs of a data set is calculated by adopting a BM25 algorithm, and the similarity degree and the correlation are multiplied to obtain comprehensive correlation;
the content recall module is used for comparing the comprehensive correlation with a set threshold value, if the comprehensive correlation is larger than the set threshold value, recalling the documents, forming a plurality of recall documents into a recall document set, and obtaining a recall content set by adopting different recall strategies according to different contents in the recall documents;
the large language model module is used for integrating the recall content set and the prompt word to form input content of a large language model, inputting the input content into the large language model, and generating corresponding answers by the large language model according to the provided input content and the questions.
6. The system of claim 5, wherein the content recall module comprises a content collection generation unit to rank recall document collections in descending order of aggregate relevance;
if the teaching material document exists in the recall document set, acquiring the id of the teaching material document with the largest comprehensive relevance score in the recall document, and combining the document with the id with the adjacent teaching material document to obtain recall content corresponding to the input problem;
if the objective problem document or the subjective problem document exists in the recall document set, acquiring corresponding answers and analyzing according to the problem numbers corresponding to the problem stems, and obtaining recall contents corresponding to the input problems;
and arranging the obtained recall content sets in descending order of the comprehensive relevance score.
7. The system of claim 6, wherein the large language model module comprises an integration unit to fetch the first K documents in the recall content collection, K being an integer such that the length of the K documents does not exceed the context processing limit of the large language model, splice the first K documents as contexts into the hint word after splicing, splice the contexts and questions into the hint word to form the input content of the large language model.
8. The system of claim 5, wherein the text vector model is text-casting-ada-002.
9. A smart terminal comprising a processor, an input device, an output device and a memory, the processor being connected to the input device, the output device and the memory, respectively, the memory being for storing a computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method of any of claims 1-4.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-4.
CN202311128057.5A 2023-09-01 2023-09-01 Question-answering system oriented to judicial examination field, data processing method and terminal Pending CN117390146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311128057.5A CN117390146A (en) 2023-09-01 2023-09-01 Question-answering system oriented to judicial examination field, data processing method and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311128057.5A CN117390146A (en) 2023-09-01 2023-09-01 Question-answering system oriented to judicial examination field, data processing method and terminal

Publications (1)

Publication Number Publication Date
CN117390146A true CN117390146A (en) 2024-01-12

Family

ID=89467340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311128057.5A Pending CN117390146A (en) 2023-09-01 2023-09-01 Question-answering system oriented to judicial examination field, data processing method and terminal

Country Status (1)

Country Link
CN (1) CN117390146A (en)

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
CN110489538B (en) Statement response method and device based on artificial intelligence and electronic equipment
CN106649786B (en) Answer retrieval method and device based on deep question answering
US6363174B1 (en) Method and apparatus for content identification and categorization of textual data
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN109815491B (en) Answer scoring method, device, computer equipment and storage medium
CN109918487A (en) Intelligent answer method and system based on network encyclopedia
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN109522397B (en) Information processing method and device
CN111666376B (en) Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching
CN111552773A (en) Method and system for searching key sentence of question or not in reading and understanding task
CN107844531B (en) Answer output method and device and computer equipment
CN116881425A (en) Universal document question-answering implementation method, system, device and storage medium
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
CN112800205B (en) Method and device for obtaining question-answer related paragraphs based on semantic change manifold analysis
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN112632956A (en) Text matching method, device, terminal and storage medium
CN116860947A (en) Text reading and understanding oriented selection question generation method, system and storage medium
CN116662518A (en) Question answering method, question answering device, electronic equipment and readable storage medium
CN115759085A (en) Information prediction method and device based on prompt model, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination