CN117875304A - Corpus construction method, system and storage medium for subway field - Google Patents
Corpus construction method, system and storage medium for subway field Download PDFInfo
- Publication number
- CN117875304A CN117875304A CN202410043265.3A CN202410043265A CN117875304A CN 117875304 A CN117875304 A CN 117875304A CN 202410043265 A CN202410043265 A CN 202410043265A CN 117875304 A CN117875304 A CN 117875304A
- Authority
- CN
- China
- Prior art keywords
- data
- subway
- voice
- model
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 31
- 238000003860 storage Methods 0.000 title claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 41
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000013145 classification model Methods 0.000 claims abstract description 25
- 238000006243 chemical reaction Methods 0.000 claims abstract description 22
- 238000004140 cleaning Methods 0.000 claims abstract description 14
- 238000005520 cutting process Methods 0.000 claims abstract description 14
- 238000002372 labelling Methods 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 12
- 239000012634 fragment Substances 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012360 testing method Methods 0.000 claims description 7
- 238000012937 correction Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 238000012805 post-processing Methods 0.000 claims description 5
- 238000013518 transcription Methods 0.000 claims description 5
- 230000035897 transcription Effects 0.000 claims description 5
- 238000009411 base construction Methods 0.000 claims description 4
- 238000013480 data collection Methods 0.000 claims description 3
- 238000002360 preparation method Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000013526 transfer learning Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 230000005012 migration Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- VQLYBLABXAHUDN-UHFFFAOYSA-N bis(4-fluorophenyl)-methyl-(1,2,4-triazol-1-ylmethyl)silane;methyl n-(1h-benzimidazol-2-yl)carbamate Chemical compound C1=CC=C2NC(NC(=O)OC)=NC2=C1.C=1C=C(F)C=CC=1[Si](C=1C=CC(F)=CC=1)(C)CN1C=NC=N1 VQLYBLABXAHUDN-UHFFFAOYSA-N 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013524 data verification Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3343—Query execution using phonetics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a corpus construction method, a system and a storage medium for subway fields, which belong to the field of data processing and comprise the following steps: collecting voice data in the subway field; the method comprises the steps of sequentially cleaning, labeling and cutting pretreatment on voice data in the subway field; text conversion is carried out on the preprocessed subway field voice data, and the converted text data are added into an own voice library based on AIHELL to form an AIHELL data set; training a voice recognition model by using data in the AISTLL data set, and recognizing the audio file integrated by using the trained voice recognition model to generate text data; constructing a business keyword matching model and a label classification model; and carrying out structural processing on the converted text data, and classifying the text data according to the business keyword matching model and the label classification model to obtain a corpus in the subway field. According to the invention, the AIHELL open source version is added into the self voice library to form the special voice library in the subway emergency field, and the recognition rate is up to more than 99%.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a corpus construction method, a corpus construction system and a storage medium for the subway field.
Background
Along with the rapid development of artificial intelligence large language models, each industry carries out vertical field service application on the large language models in the process of considering digital transformation, and the vertical field application of the industry has industry attribute and high accuracy, unlike wide application, and the problem of solving the industry corpus is the most fundamental problem.
The arrangement of the industry corpus is to perform operations such as business subdivision, multi-mode data conversion, timely updating and the like, so that the industry vertical application can be met; according to the invention, daily voice service interaction in the subway field is subjected to service category and corpus conversion, and simultaneously, in order to solve the problem of accuracy, the corpus is subjected to CSV format conversion and is supplied to a large language model for training, and the whole process does not need too much manual participation, so that the manpower is greatly saved, and the cost is saved for large language model application in the industry vertical field. The daily business in the industry field has the data asset attribute, and the invention can effectively improve the data asset conversion efficiency in the industry field and reduce the cost and improve the efficiency for the industry field.
Currently, in the construction of various corpora, a hundred-degree SPEECH capability engine (SPEECH) is often used to serve SPEECH and a large-scale SPEECH is often used to perform SPEECH data conversion by the fly-to-hear, so as to construct a related database.
(1) The hundred degree SPEECH capability engine (SPEECH) services SPEECH to text core technology is as follows:
(1) acoustic model: acoustic models are an important component in speech recognition for mapping sound signals to audio features. The hundred degrees adopt Deep Neural Network (DNN) and Convolutional Neural Network (CNN) and other technologies to construct an acoustic model for extracting features and modes from sound;
(2) language model: the language model is used for text inference and error correction based on the context and language knowledge of the speech signal. The hundred degrees construct a language model by utilizing large-scale language data and neural network technology so as to improve the accuracy and fluency of converting voice into text.
(3) End-to-end model: the hundred degrees develop an end-to-end voice recognition model, integrate an acoustic model and a language model into a unified model, and realize direct conversion from original voice to text. The end-to-end model can reduce intermediate steps and error transmission in the traditional voice recognition flow and improve recognition performance.
(4) Data enhancement techniques: in order to improve the robustness and generalization ability of the model, the hundred degrees employ data enhancement techniques to augment the training data. This includes adding noise, varying speech speed and pitch, simulating different recording environments, etc., to make the model better suited for different speech inputs.
(5) Migration learning: the hundred degrees utilize a transfer learning technique to fine tune models pre-trained on large-scale generic data to accommodate speech-to-text requirements of a particular domain or task. Such transfer learning can speed up model training and improve model performance in a particular area.
(2) The core technology of the message flying hearing voice conversion text function is as follows:
(1) acoustic model: the scientific mass communication technology adopts a deep learning technology to construct an acoustic model for converting sound signals into voice characteristics. The models comprise a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a long and short time memory network (LSTM) and the like, so that voice features can be effectively extracted and pattern matching can be carried out, and accurate voice recognition can be realized.
(2) Language model: the large-scale corpus and language model technology are utilized by the scientific large-scale corpus to model grammar and semantics in the process of converting voice into text. The language model can capture the statistical rule and the context information of the language, and improve the accuracy and fluency of voice recognition.
(3) Pretreatment of voice: the scientific mass flow developed a series of voice preprocessing techniques for noise reduction, acoustic feature enhancement, voice signal enhancement and the like. These techniques can improve the robustness of a speech recognition system to environmental factors such as noise, pitch variation, and signal distortion.
(4) End-to-end model: the scientific mass communication flyer also develops an end-to-end voice recognition model, integrates an acoustic model, a language model, a pronunciation model and the like into a unified model, and realizes direct conversion from original voice to text. The end-to-end model simplifies the flow of the traditional voice recognition system and improves the recognition performance and efficiency.
(5) Migration learning: the scientific mass communication flyer adopts a transfer learning technology to finely tune a model pre-trained on large-scale general data so as to adapt to the voice-to-text requirement of a specific field or task. Such transfer learning can accelerate model training and improve model performance in specific areas.
The prior art lacks the depth of the special field, and from the technical point of view, the two similar products serve the general industry, the knowledge base of the special field of the subway is lacking, the comprehensive recognition rate is verified to be lower than 95%, the main reason is that the knowledge base of the special field of the subway is lack of learning, the recognition rate of daily business terms is high, and the recognition rate of the special terms is low.
Disclosure of Invention
In order to solve the problems, the invention provides a corpus construction method system, a corpus construction system and a storage medium for subway fields.
In order to achieve the above object, the present invention provides the following technical solutions:
a corpus construction method for subway field comprises the following steps:
collecting voice data in the subway field;
cleaning, marking and cutting pretreatment are sequentially carried out on the voice data in the subway field;
text conversion is carried out on the preprocessed subway field voice data, and the converted text data are added into an AIHELL own voice library based on AIHELL to form an AIHELL data set;
training a voice recognition model by using data in the AIHELL data set, and recognizing an audio file integrated by using the trained voice recognition model to generate text data;
constructing a business keyword matching model and a label classification model in the subway field;
and carrying out structuring treatment on text data generated by the voice recognition model, and classifying the structured data according to the business keyword matching model and the label classification model to obtain a corpus in the subway field.
Preferably, the collecting the voice data in the subway field specifically includes: through recording at emergent command center NCC, operation control center OCC, subway station, carriage, gather subway field speech data, the speech data of gathering covers different scenes, speech speed, intonation and noise environment.
Preferably, the cleaning, labeling and cutting pretreatment are sequentially performed on the voice data in the subway field, and specifically include:
cleaning, namely removing noise, noise and irrelevant voice fragments in voice data;
labeling, namely performing text transcription on the cleaned voice data, enabling the voice to correspond to corresponding text content, automatically identifying the voice data by using marking tools Kaldi and Kaldi, and then performing post-processing and error correction to generate an initial labeling result;
cutting, namely cutting the cleaned and marked voice data into audio fragments with set sizes, and organizing and storing the audio fragments according to a specific directory structure.
Preferably, the text conversion of the preprocessed voice data in the subway field specifically includes:
preparing data, namely sorting audio files of a voice library in the subway field according to the format of an AISSEL data set, and creating corresponding text annotation files, wherein each audio file corresponds to one text annotation, and the naming of the annotation files corresponds to the audio files;
dividing data, namely dividing the self voice library data in the classified subway field into a training set, a verification set and a test set, wherein the audio files in each data set are matched with the corresponding text labeling files;
the training of the voice recognition model by using the data in the AIHELL data set specifically comprises the following steps:
training a voice recognition model by using an AISTEL data set and a corresponding text annotation file, and training by using a TensorFlow learning framework and combining a TensorFlow self-contained model library.
Preferably, the identifying the audio files after the data set is integrated using the trained speech recognition model includes identifying one audio file by one audio file or processing a plurality of audio files in batch.
Preferably, the building of the business keyword matching model and the label classification model specifically includes:
the construction of the business keyword matching model comprises the following steps:
data preparation, wherein each professional in the subway field collects sample texts containing keywords as training data, and each sample text at least contains one label which indicates whether the keywords are contained or not;
feature extraction: extracting characteristics of the text, and representing the text as a numerical vector by using a word bag model, TF-IDF and a word embedding method;
the business keyword matching model is subjected to model training by using a naive Bayes classifier;
the constructing the label classification model comprises the following steps: the number of layers of various labels is set according to the service reach depth of the subway field professionals.
Preferably, the structuring processing is performed on the converted text data, and the text data is classified according to a business keyword matching model and a label classification model, so as to obtain a corpus in the subway field, which specifically comprises:
text preprocessing, namely performing special character removal, word segmentation and stop word removal operation on the tagged text data;
key information identification, namely identifying key information in a text in a manner of key word matching, regular expression and the like;
entity identification, namely identifying subway line, station and date specific entity information in a text;
relationship extraction, namely extracting relationship information in the subway field according to semantics and context in the text;
and establishing a knowledge base, and carrying out structural storage on the extracted knowledge in the subway field.
Preferably, the method further comprises the step of splitting data of a corpus in the subway field to form a business question-answer knowledge base QA, and specifically comprises the following steps:
defining a problem type, namely defining the problem type of the subway field according to the specialty;
organizing a corpus according to the problem types, grouping labels in a knowledge base according to the correspondence of the problem types, wherein each problem type corresponds to one label set, classifying the labels to be imported into the knowledge base system, and generating the corpus;
questions are generated, and for each question type, a series of related questions are generated from information in the knowledge base.
The invention also provides a corpus construction system for the subway field, which comprises the following steps:
the data collection module is used for collecting voice data in the subway field;
the data preprocessing module is used for sequentially cleaning, marking and cutting the voice data in the subway field;
the data set construction module is used for carrying out text conversion on the preprocessed subway field voice data, and adding the converted text data into an AIHELL own voice library based on AIHELL to form an AIHELL data set;
the model training module is used for training the voice recognition model by utilizing the data in the AIHELL data set, and recognizing the audio file integrated by the AIHELL data set by utilizing the trained voice recognition model to generate text data;
the model building module is used for building a business keyword matching model and a label classification model in the subway field;
the knowledge base construction module is used for carrying out structuring treatment on text data generated by the voice recognition model, classifying the structured data according to the business keyword matching model and the label classification model, and obtaining a corpus in the subway field.
The invention also provides a computer readable storage medium, on which a computer program is stored, which when loaded by a processor is capable of executing the steps of the corpus construction method for subway fields.
The corpus construction method system for the subway field provided by the invention has the following beneficial effects:
according to the method, firstly, collected voice data in the subway field is preprocessed, an AIHELL data set is built by combining with an AIHELL own voice library, so that the built data set can meet the depth of the special field, meanwhile, the voice recognition model is trained by utilizing data in the AIHELL data set, the recognition rate of the model on the professional term is improved, the trained voice recognition model is used for recognizing an audio file integrated by the AIHELL data set to generate text data, meanwhile, the data is classified by combining with a built business keyword matching model and a label classification model of the subway field, cross repeated data can be effectively removed, and finally, the subway field corpus is obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention and the design thereof, the drawings required for the embodiments will be briefly described below. The drawings in the following description are only some of the embodiments of the present invention and other drawings may be made by those skilled in the art without the exercise of inventive faculty.
Fig. 1 is a flowchart of a corpus construction method for subway field in embodiment 1 of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the drawings and the embodiments, so that those skilled in the art can better understand the technical scheme of the present invention and can implement the same. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Example 1
The corpus related by the invention can be understood as pure text material for large language model learning; the knowledge base refers to the business types to be expressed in the corpus and the corresponding countermeasures.
Specifically, the invention provides a corpus construction method for subway fields, as shown in fig. 1, comprising the following steps:
step 1: the voice data in the subway field is collected, the main purpose is to establish a voice library owned by the subway field, the problem of recognition accuracy in the subway field is solved, and the voice library is specifically as follows:
the voice data in the subway field is collected, and recording and collecting are carried out in places such as an emergency command center NCC, an operation control center OCC, a subway station, a carriage and the like, so that collected voice data are ensured to cover different scenes, speech speeds, intonation, noise environments and the like.
Step 2: preprocessing the collected data, specifically including:
and (3) cleaning and marking the collected voice data, wherein the cleaning comprises removing noise, noise and irrelevant voice fragments, the marking comprises text transcription of the voice data, corresponding the voice to corresponding text contents, automatically identifying the voice data by using marking tools Kaldi, and then performing post-processing and error correction to generate an initial marking result.
Cutting data, namely cutting the cleaned and marked voice data into small audio fragments, and organizing and storing the audio fragments according to a specific directory structure. The system can be organized according to subway stations, train lines, voice instruction types and the like, and is convenient for subsequent retrieval and use.
Step 3: and carrying out text conversion on the preprocessed subway field voice data, and adding the converted text data into an AIHELL own voice library based on AIHELL to form an AIHELL data set. Specifically, the voice-to-text recognition is based on the AIHELL added into the own voice library to form an AIHELL data set, so that the AIHELL data set can be better adapted to the regional environment of the client on the basis of the AIHELL data set, and the process is as follows:
preparing data, namely arranging audio files of a voice library in the self subway field according to the format of an AIHELL data set, (including putting the audio files into corresponding folders and naming modes); and creating corresponding text annotation files, wherein each audio file corresponds to one text annotation, and the naming of the annotation files corresponds to the audio files.
The data is divided, the data of the self voice library in the subway field which is well sorted is divided into a training set, a verification set and a test set, wherein the training set, the verification set and the test set are respectively used for training, the verification set and the test set are respectively used for verifying and verifying, and the test set is used for testing, so that the matching of the audio files in each data set and the corresponding text labeling files is ensured.
Step 4: training a voice recognition model by using data in the AIHELL data set, and recognizing the integrated audio file of the AIHELL data set by using the trained voice recognition model to generate text data. Specifically, training a voice recognition model (the voice recognition model can be a recognition model of companies such as mass news, hundred degrees and the like or a self-built hybrid DNN-HMM model), training the voice recognition model by using an AISHELL data set (including own voice library data) and a corresponding text annotation file, training by using a TensorFlow learning framework and combining with a TensorFlow self-contained model library, and specifically includes:
the recognition can be carried out on the audio files one by using the trained voice recognition model, and a plurality of audio files can be processed in batches.
Post-processing and evaluation, and performing post-processing (error correction) on the identification result to further improve the accuracy and readability of the identification result. And evaluating the recognition result, comparing the difference between the recognition result and the original text label, calculating indexes such as accuracy, recall rate and the like, and evaluating the performance of the model.
Step 5: constructing a business keyword matching model and a label classification model in the subway field, wherein a section of business description possibly contains a plurality of labels, and the triggering conforming to the labels needs the business keyword matching model to be completed, for example: the passenger flow of the northwest station is large, the A port is congested, passengers are injured, and the station is required to be coordinated. The text triggers the labels of address, business classification, job and article class 4, and the construction process is as follows:
(1) Keyword model
Data preparation, each professional in the subway field collects sample texts containing keywords as training data, and each sample text at least contains one label which indicates whether the keywords are contained.
Feature extraction: the text is subjected to feature extraction, and the text is expressed as a numerical vector by using a word bag model, TF-IDF, word embedding and other methods.
The business keyword matching model uses a naive bayes classifier for model training.
And (3) evaluating the model, namely evaluating the performance of the model by using indexes of accuracy, precision, recall and F1 value.
(2) Label classification model, subway field divide into 8 specialty: the method comprises the steps of building professions, power supply professions, number-through professions (including communication and signal professions), vehicle professions, passenger transport professions, electromechanical professions, passenger service professions and scheduling professions, wherein labels are classified into 8 classes corresponding to the professions, and the number of layers of various labels is set according to the service reach depth of the professions; the scheme also considers the use of the composite tag of the data, for example: a certain section of power supply fault of the line 1 needs to be salvaged, and the line scheduling is closed to operate a related fault area; this piece of information is designed to supply power, pass numbers, passenger traffic, schedule 4 tag categories.
Step 6: the text data generated by the voice recognition model is structured, and the structured data is classified according to the business keyword matching model and the label classification model to obtain a corpus in the subway field, which comprises the following steps:
and (3) preprocessing the text, and performing operations such as removing special characters, word segmentation, removing stop words and the like on the labeled text data.
And (3) identifying key information, namely identifying the key information in the text in a key word matching mode, a regular expression mode and the like, for example, extracting subway line names, site names, operation adjustment information, construction information and the like.
Entity identification, such as named entity identification (NER), identifies specific entity information such as subway lines, stations, dates and the like in the text.
And extracting relation information in the subway field, such as association relation between subway lines and stations, operation adjustment, association relation between specific dates and the like, according to the semantics and the context in the text.
The method comprises the steps of establishing a knowledge base, carrying out structural storage on the extracted knowledge in the subway field, and storing and managing the knowledge in a database mode to support an expansion graph database; the storage classification is implemented according to 8 subway field professions and is the same as the label system classification.
Step 7: and (5) auditing.
Inviting the professional, operation and management staff of 8 professional fields of the subway to perform class-level manual effect correction on the knowledge base, and if the problem exists, returning a 'business keyword matching model label classification model' to perform model adjustment, so as to gradually improve the accuracy of the knowledge base.
Step 8: and splitting the data in the constructed subway field corpus to form QA, wherein a business question-answer knowledge base (according to a subway standardability file) mainly comprises standing and emergency.
The problem types are defined according to 8 professions, such as operation adjustment, ticket information, transfer guide, flood emergency, fire handling specification and the like.
And organizing a corpus according to the problem types, grouping the labels in the knowledge base according to the problem types, wherein each problem type corresponds to a label set, and if a normative document file is encountered, for example, the normative document file is required to be imported into the knowledge base system for label classification and then corpus generation.
And generating problems, namely generating a series of related problems according to information in a knowledge base for each problem type, wherein the large language model has vector expansion, natural semantic understanding and the like, so that the problem types of the corpus are classified relatively carefully, and the whole amount of problems are not required to be collected by referring to an intelligent question-answering mode.
Step 9: and carrying out CSV conversion on the subway field corpus so as to supply a large model for use.
Through the verification of a large language model ChatGLM, the accuracy of the accurate corpus in the CSV form is highest, so that the corpus is converted into the CSV form in the last step.
Regarding steps 5, 6 and 7, the combination of the "business keyword matching model label classification model" and the "knowledge extraction" for constructing the corpus of the subway field according to label classification improves the accuracy of the knowledge base in the subway special field, and in special field application, the accuracy is the first element, if the matching model label classification model is used alone or the knowledge base is only extracted, the problems of low accuracy, low accuracy and the like when the data volume is increased can occur, the cross repeated data can be effectively removed by combining the two, the business label classification cleaning can be performed, and the accurate energization can be performed for later business application. Table 1 compares the effect of model use:
table 1 results of effect comparison
Regarding the manual auditing part in the step 5, all professionals are required to calibrate knowledge, if the coverage rate is low and the repetition rate is high, the manual workload can be relatively heavy, and the manual intervention is performed after the combination of the steps 3 and 4, so that the workload is obviously reduced, and the efficiency of the whole scheme is improved.
Step 8, splitting the data to form a corpus in a question-answer (QA) form, wherein the corpus of the general large language model is generally imported in a segmented mode, and the large language model in the vertical field of the subway is required to have high service accuracy, so that the traditional segmented method is difficult to meet the service requirement of intelligent interaction, other corpus modes are required to be selected, and the question-answer (QA) corpus effect can reach customer satisfaction through data verification.
Based on the same inventive concept, the invention also provides a corpus construction system for the subway field, which comprises a data collection module, a data preprocessing module, a data set construction module, a model training module, a model construction module and a knowledge base construction module.
Specifically, the data preprocessing module is used for sequentially cleaning, labeling and cutting preprocessing the voice data in the subway field; the data set construction module is used for carrying out text conversion on the preprocessed subway field voice data, and adding the converted text data into an AIHELL own voice library based on AIHELL to form an AIHELL data set; the model training module is used for training a voice recognition model by utilizing data in the AIHELL data set, and recognizing the audio file integrated by the AIHELL data set by utilizing the trained voice recognition model to generate text data; the model building module is used for building a business keyword matching model and a label classification model in the subway field; the knowledge base construction module is used for carrying out structuring treatment on text data generated by the voice recognition model, classifying the structured data according to the business keyword matching model and the label classification model, and obtaining a corpus in the subway field.
All or part of each module in the corpus construction system for the subway field can be realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Further, the present invention also provides a non-transitory computer readable storage medium containing instructions, the storage medium having a computer program stored thereon. Such as a memory containing instructions executable by a processor of a computer device to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. When the computer program is executed by the processor, the steps in the embodiment of the corpus construction method for the subway field can be realized. The specific implementation method may refer to a method embodiment, and will not be described herein.
The invention is distinguished from the prior art mainly as follows:
(1) the invention refines the customer business key points based on AISCHEL open source version (Chinese voice recognition data set, which contains Chinese voice data from different speakers, and adds own voice library by opening a part of data for developer to use in Chinese voice recognition development), forming subway emergency field special voice library, the recognition rate reaches more than 99%, the specific steps are as follows:
download AIHELL functional network (http:// www.aishelltech.com /) download AIHELL open source dataset. The dataset contains recorded chinese speech data and corresponding text transcription.
And collecting data, namely collecting audio data in the subway emergency field, wherein the audio data comprise voice sample data sets of different people and different environments.
Data preprocessing, including removing silence segments, noise reduction, audio normalization, etc., improves speech quality and reduces the impact of noise on subsequent processing.
Feature extraction, extracting features from the preprocessed voice data, wherein the features comprise Mel Frequency Cepstrum Coefficient (MFCC), linear Predictive Coding (LPC) and filter bank frequency response (FBANK), and the features represent the spectral characteristics of the voice and can be used for subsequent voice recognition tasks.
Labeling and tagging correct labels and tags are added to the voice data using text transcription provided in the AISHELL dataset.
Model training, training a speech recognition model using labeled and tagged speech data sets. The aim of training is to learn the correlation between the features of the speech data and the corresponding labels using hidden Markov models (Hidden Markov Model, HMM), deep neural networks (Deep Neural Network, DNN), long Short-Term Memory networks (LSTM).
And evaluating and optimizing the trained voice recognition model by using an evaluation data set, and performing optimization processing, wherein evaluation indexes comprise recognition accuracy and false recognition rate, and the model is adjusted and optimized according to an evaluation result so as to improve the voice recognition performance.
The method comprises the steps of deploying, deploying a trained voice recognition model into practical application to perform voice-to-text tasks, preprocessing input voice data and extracting features when a voice library is used, and then recognizing by using the trained model.
(2) In the aspect of intelligent assistance, most of the prior art adopts a label system for display, the invention adopts a mode of supplying a large ChatGLM model to an accurate industry corpus, the intelligent assistance effect and accuracy of the label system are not ideal, and the mode of supplying the large language model accurate corpus needs to be further carried out on the basis of the label system, which is as follows:
a label system construction step, namely deeply researching the subway emergency field and knowing relevant service characteristics, key concepts and content structures; determining the hierarchical structure and classification mode of the tag, designing 3 layers of the hierarchical design, and classifying the hierarchical design into 8 types according to professions; defining and standardizing each label, and defining the format of the label so as to facilitate the uniformity of the label; establishing an association and relationship model among the tags in a tag system, and carrying out 8 professional classification association on the tags according to the correlation, hierarchical structure and logic relationship of the subway emergency field; in the application process, actual use and feedback optimization are carried out, and the tags are adjusted, expanded and optimized through subway emergency service tie-up training, so that the performance is continuously improved.
The construction process of the accurate corpus after voice-text conversion in the subway emergency field comprises the following steps:
and confirming the related content of the subway emergency field, and collecting related specification requirements, technical details, corresponding voice data (including but not limited to subway emergency specification, operation circuit diagram, site information, emergency related material information, passenger question-answer records, emergency training voice data and the like).
After text conversion is carried out on the collected data, cleaning and preprocessing are carried out so as to ensure the quality and accuracy of the data. This includes removing duplicate data, extracting, denoising, etc., correcting error information, etc.
Classifying a corpus into a label system, such as subway station names, line names, passenger flows, weather and the like;
and (3) verifying and checking the data after labeling classification to ensure the accuracy and consistency of labeling, checking whether the data are correct for the proper nouns mainly through a manual examination mode, and correcting and adjusting (the examination workload is greatly reduced through the early acquisition of the special audio data).
And (3) expanding and enhancing the data, and expanding the scale and diversity of the corpus through a data enhancement technology. For example, a corpus sample of english is generated using synthesis techniques (using translation mechanisms).
Text file data management and storage, establishing a Lustre distributed file system, providing high-performance file storage and access, providing expansion space for later high-performance computing (HPC), and supporting large-scale data parallel access and processing so as to effectively manage and store corpus data.
Continuous updating and optimizing are needed for the corpus in the subway field to adapt to new lines, sites, emergency mechanism adjustment and change and the like, new data are collected and updated regularly, and meanwhile, the corpus is continuously optimized and perfected according to user feedback and requirements.
According to the invention, daily voice service interaction in the subway field is subjected to service category and corpus conversion, and simultaneously, in order to solve the problem of accuracy, the corpus is subjected to CSV format conversion and is supplied to a large language model for training, and the whole process does not need too much manual participation, so that the manpower is greatly saved, and the cost is saved for large language model application in the industry vertical field. The daily business in the industry field has the data asset attribute, and the invention can effectively improve the data asset conversion efficiency in the industry field and reduce the cost and improve the efficiency for the industry field. According to the invention, the AIHELL open source version is added into the self voice library to form the special voice library in the subway emergency field, and the recognition rate is up to more than 99%.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be noted that the above-described embodiments will enable those skilled in the art to more fully understand the invention, but do not limit it in any way. Thus, although the present invention has been described in detail with reference to the present specification and examples, it should be understood by those skilled in the art that the present invention may be modified or equivalents; all technical schemes and improvements which do not depart from the spirit and scope of the invention are covered by the protection scope of the invention. Any reference sign in a claim should not be construed as limiting the claim concerned.
Claims (10)
1. The corpus construction method for the subway field is characterized by comprising the following steps of:
collecting voice data in the subway field;
cleaning, marking and cutting pretreatment are sequentially carried out on the voice data in the subway field;
text conversion is carried out on the preprocessed subway field voice data, and the converted text data are added into an AIHELL own voice library based on AIHELL to form an AIHELL data set;
training a voice recognition model by using data in the AIHELL data set, and recognizing an audio file integrated by using the trained voice recognition model to generate text data;
constructing a business keyword matching model and a label classification model in the subway field;
and carrying out structuring treatment on text data generated by the voice recognition model, and classifying the structured data according to the business keyword matching model and the label classification model to obtain a corpus in the subway field.
2. The method for constructing a corpus for subway fields according to claim 1, wherein the collecting of the voice data of the subway fields specifically comprises: through recording at emergent command center NCC, operation control center OCC, subway station, carriage, gather subway field speech data, the speech data of gathering covers different scenes, speech speed, intonation and noise environment.
3. The method for constructing a corpus for subway fields according to claim 1, wherein the cleaning, labeling and cutting pretreatment are sequentially performed on the voice data of the subway fields, specifically comprising:
cleaning, namely removing noise, noise and irrelevant voice fragments in voice data;
labeling, namely performing text transcription on the cleaned voice data, enabling the voice to correspond to corresponding text content, automatically identifying the voice data by using marking tools Kaldi and Kaldi, and then performing post-processing and error correction to generate an initial labeling result;
cutting, namely cutting the cleaned and marked voice data into audio fragments with set sizes, and organizing and storing the audio fragments according to a specific directory structure.
4. The method for constructing a corpus for subway fields according to claim 1, wherein the text-converting the preprocessed voice data for subway fields specifically comprises:
preparing data, namely sorting audio files of a voice library in the subway field according to the format of an AISSEL data set, and creating corresponding text annotation files, wherein each audio file corresponds to one text annotation, and the naming of the annotation files corresponds to the audio files;
dividing data, namely dividing the self voice library data in the classified subway field into a training set, a verification set and a test set, wherein the audio files in each data set are matched with the corresponding text labeling files;
the training of the voice recognition model by using the data in the AIHELL data set specifically comprises the following steps:
training a voice recognition model by using an AISTEL data set and a corresponding text annotation file, and training by using a TensorFlow learning framework and combining a TensorFlow self-contained model library.
5. The method of claim 4, wherein the step of using the trained speech recognition model to identify the audio files after the data set is integrated comprises identifying each audio file or processing a plurality of audio files in batch.
6. The method for constructing a corpus for subway fields according to claim 1, wherein the constructing a business keyword matching model and a label classification model specifically comprises:
the construction of the business keyword matching model comprises the following steps:
data preparation, wherein each professional in the subway field collects sample texts containing keywords as training data, and each sample text at least contains one label which indicates whether the keywords are contained or not;
feature extraction: extracting characteristics of the text, and representing the text as a numerical vector by using a word bag model, TF-IDF and a word embedding method;
the business keyword matching model is subjected to model training by using a naive Bayes classifier;
the constructing the label classification model comprises the following steps: the number of layers of various labels is set according to the service reach depth of the subway field professionals.
7. The method for constructing a corpus in the subway field according to claim 1, wherein the structuring the converted text data, classifying the text data according to a business keyword matching model and a label classification model, and obtaining the corpus in the subway field specifically comprises:
text preprocessing, namely performing special character removal, word segmentation and stop word removal operation on the tagged text data;
key information identification, namely identifying key information in a text in a manner of key word matching, regular expression and the like;
entity identification, namely identifying subway line, station and date specific entity information in a text;
relationship extraction, namely extracting relationship information in the subway field according to semantics and context in the text;
and establishing a knowledge base, and carrying out structural storage on the extracted knowledge in the subway field.
8. The method for constructing a corpus for subway fields according to claim 1, further comprising splitting data of the corpus for subway fields to form a business question-answer knowledge base QA, specifically:
defining a problem type, namely defining the problem type of the subway field according to the specialty;
organizing a corpus according to the problem types, grouping labels in a knowledge base according to the correspondence of the problem types, wherein each problem type corresponds to one label set, classifying the labels to be imported into the knowledge base system, and generating the corpus;
questions are generated, and for each question type, a series of related questions are generated from information in the knowledge base.
9. The utility model provides a corpus construction system for subway field which characterized in that includes:
the data collection module is used for collecting voice data in the subway field;
the data preprocessing module is used for sequentially cleaning, marking and cutting the voice data in the subway field;
the data set construction module is used for carrying out text conversion on the preprocessed subway field voice data, and adding the converted text data into an AIHELL own voice library based on AIHELL to form an AIHELL data set;
the model training module is used for training the voice recognition model by utilizing the data in the AIHELL data set, and recognizing the audio file integrated by the AIHELL data set by utilizing the trained voice recognition model to generate text data;
the model building module is used for building a business keyword matching model and a label classification model in the subway field;
the knowledge base construction module is used for carrying out structuring treatment on text data generated by the voice recognition model, classifying the structured data according to the business keyword matching model and the label classification model, and obtaining a corpus in the subway field.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when loaded by a processor, is capable of executing the steps of the corpus construction method for metro areas as claimed in any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410043265.3A CN117875304A (en) | 2024-01-11 | 2024-01-11 | Corpus construction method, system and storage medium for subway field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410043265.3A CN117875304A (en) | 2024-01-11 | 2024-01-11 | Corpus construction method, system and storage medium for subway field |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117875304A true CN117875304A (en) | 2024-04-12 |
Family
ID=90578829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410043265.3A Pending CN117875304A (en) | 2024-01-11 | 2024-01-11 | Corpus construction method, system and storage medium for subway field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117875304A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190370394A1 (en) * | 2018-05-31 | 2019-12-05 | Fmr Llc | Automated computer text classification and routing using artificial intelligence transfer learning |
CN113626596A (en) * | 2021-07-20 | 2021-11-09 | 西安理工大学 | Subway design specification text analysis and corpus construction method based on deep learning |
CN115858758A (en) * | 2022-12-28 | 2023-03-28 | 国家电网有限公司信息通信分公司 | Intelligent customer service knowledge graph system with multiple unstructured data identification |
CN117076693A (en) * | 2023-07-11 | 2023-11-17 | 华中师范大学 | Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus |
CN117112760A (en) * | 2023-08-26 | 2023-11-24 | 四川欣龙信创科技有限公司 | Intelligent education big model based on knowledge base |
-
2024
- 2024-01-11 CN CN202410043265.3A patent/CN117875304A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190370394A1 (en) * | 2018-05-31 | 2019-12-05 | Fmr Llc | Automated computer text classification and routing using artificial intelligence transfer learning |
CN113626596A (en) * | 2021-07-20 | 2021-11-09 | 西安理工大学 | Subway design specification text analysis and corpus construction method based on deep learning |
CN115858758A (en) * | 2022-12-28 | 2023-03-28 | 国家电网有限公司信息通信分公司 | Intelligent customer service knowledge graph system with multiple unstructured data identification |
CN117076693A (en) * | 2023-07-11 | 2023-11-17 | 华中师范大学 | Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus |
CN117112760A (en) * | 2023-08-26 | 2023-11-24 | 四川欣龙信创科技有限公司 | Intelligent education big model based on knowledge base |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107766371B (en) | Text information classification method and device | |
CN105957531B (en) | Speech content extraction method and device based on cloud platform | |
CN103810998B (en) | Based on the off-line audio recognition method of mobile terminal device and realize method | |
US11727915B1 (en) | Method and terminal for generating simulated voice of virtual teacher | |
CN111177350A (en) | Method, device and system for forming dialect of intelligent voice robot | |
Mamyrbayev et al. | End-to-end speech recognition in agglutinative languages | |
CN108763338A (en) | A kind of News Collection&Edit System based on power industry | |
CN111597328A (en) | New event theme extraction method | |
CN108536673B (en) | News event extraction method and device | |
CN113051887A (en) | Method, system and device for extracting announcement information elements | |
CN116092472A (en) | Speech synthesis method and synthesis system | |
CN117474507A (en) | Intelligent recruitment matching method and system based on big data application technology | |
CN117112760A (en) | Intelligent education big model based on knowledge base | |
CN114239579A (en) | Electric power searchable document extraction method and device based on regular expression and CRF model | |
CN117809655A (en) | Audio processing method, device, equipment and storage medium | |
CN117216008A (en) | Knowledge graph-based archive multi-mode intelligent compiling method and system | |
CN116129868A (en) | Method and system for generating structured photo | |
CN115641860A (en) | Model training method, voice conversion method and device, equipment and storage medium | |
CN117875304A (en) | Corpus construction method, system and storage medium for subway field | |
CN110851572A (en) | Session labeling method and device, storage medium and electronic equipment | |
CN112488593B (en) | Auxiliary bid evaluation system and method for bidding | |
CN111881106B (en) | Data labeling and processing method based on AI (advanced technology attachment) test | |
Sun | Design and implementation of English speech scoring data system based on neural network algorithm | |
CN112507060A (en) | Domain corpus construction method and system | |
Manenti et al. | Unsupervised speech unit discovery using k-means and neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |