WO2020155750A1

WO2020155750A1 - Artificial intelligence-based corpus collecting method, apparatus, device, and storage medium

Info

Publication number: WO2020155750A1
Application number: PCT/CN2019/117261
Authority: WO
Inventors: 杨雨晨
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-01-28
Filing date: 2019-11-11
Publication date: 2020-08-06
Also published as: CN110008378B; CN110008378A

Abstract

An artificial intelligence-based corpus collecting method, apparatus, and device, and a storage medium, related to the technical field of natural language processing. The method comprises: acquiring configuration item information inputted by a user (S101); downloading from a video website video data of a target video produced by searching for a target video keyword, the video data comprising a video file and an SRT subtitle file (S102); separating an audio file from the video file, splitting a subtitle text content parsed from the SRT subtitle file into subtitle blocks (S103); segmenting the audio file on the basis of a segment time of each subtitle block to acquire segmented audios (S104); establishing associations between the segmented audios and the subtitle blocks (S105); sorting and screening the associated segmented audios and subtitle blocks according to a preset screening keyword and then jointly storing as a target corpus (S106). The method implements the goal of automatically and quickly collecting a corpus satisfying requirements of a certain type of scenarios and is highly efficient and inexpensive.

Description

Artificial intelligence-based corpus collection method, device, equipment and storage medium

This application is based on the Chinese invention patent application filed on January 28, 2019 with the application number 201910081793.7 and titled "Artificial Intelligence-Based Corpus Collection Method, Device, Equipment, and Storage Medium", and claims priority.

Technical field

This application belongs to the field of natural language processing technology, and relates to artificial intelligence-based corpus collection methods, devices, equipment, and storage media.

Background technique

Artificial Intelligence (AI) is a new technological science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. Artificial intelligence is a branch of computer science. It attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a similar way to human intelligence. Research in this field includes robotics, language recognition, image recognition, Natural language processing and expert systems, etc.

In reality, AI-based natural language processing often requires pre-collection of corpus that meets various scenarios. For example, in the application of the automatic processing platform for customer service complaints: corpus on the performance of "complaints", "dissatisfaction" and time urgency is needed to facilitate According to the urgency and severity, the priority of work order access and the handling of the docking person can be flexibly adjusted to help quickly deal with complaints and solve problems; for example, in early childhood education and children’s fun dialogue software applications, it needs to be based on children’s voice and The mood is more cheerful and lively corpus.

The existing methods of collecting corpus of a certain scene mainly include: (1) Obtain the corpus of a certain scene through free resource search. The corpus obtained in this way is very limited and difficult to meet the needs; (2) Through the team's own recording and recording To obtain the corpus of a certain scene by labeling, this method is inefficient and extremely labor-intensive; (3) It is more costly to purchase the corpus of a certain scene through channels.

Therefore, the existing corpus collection methods are inefficient and costly. How to quickly collect corpus that meets the needs of a certain type of scene has become an urgent problem to be solved.

Summary of the invention

The embodiments of the present application disclose an artificial intelligence-based corpus collection method, device, equipment, and storage medium that can quickly collect corpus that meets a certain scenario.

Some embodiments of the present application disclose an artificial intelligence-based corpus collection method, including: obtaining configuration item information input by a user, the configuration item information including target video keywords and video websites, the video website being a video website URL or the name of the video website; download from the video website the video data of the target video obtained by retrieving the target video keyword, the video data including the video file and the SRT subtitle file; separate the audio file from the video file, And split the subtitle text content parsed from the SRT subtitle file into subtitle blocks; divide the audio file according to the segmentation time of each subtitle block to obtain the segmented audio; establish the association between the segmented audio and the subtitle block; The subsequent segmented audio and subtitle blocks are classified and filtered according to preset filtering keywords and then stored together as the target corpus.

Some embodiments of the present application also disclose an artificial intelligence-based corpus collection device, including: a configuration item information acquisition module for acquiring configuration item information input by a user, the configuration item information including target video keywords and video websites The video website is the URL of the video website or the name of the video website; the video data download module is used to download the video data of the target video obtained by retrieving the target video keyword from the video website, and the video data includes Video files and SRT subtitle files; audio subtitle processing module, used to separate audio files from video files, and split the subtitle text content parsed from SRT subtitle files into subtitle blocks; audio segmentation module, used to separate each The segmentation time of the subtitle block divides the audio file to obtain the segmented audio; the audio subtitle block association module is used to establish the association between the segmented audio and the subtitle block; the filtering module is used to associate the associated segmented audio and subtitles The blocks are classified and filtered according to preset filtering keywords and then stored together as the target corpus.

Some embodiments of the present application also disclose a computer device, including a memory and a processor. The memory stores computer-readable instructions. When the processor executes the computer-readable instructions, the above artificial intelligence-based The steps of the corpus collection method.

Some embodiments of the present application also disclose a non-volatile readable storage medium. The non-volatile readable storage medium stores computer readable instructions. When the computer readable instructions are executed by a processor, Implement the steps of the above artificial intelligence-based corpus collection method.

The details of one or more embodiments of the present application are presented in the following drawings and descriptions, and other features and advantages of the present application will become apparent from the description, drawings and claims.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, without creative labor, other drawings can be obtained from these drawings.

Fig. 1 is a flowchart of an artificial intelligence-based corpus collection method provided by an embodiment of the application;

FIG. 2 is a flowchart of a second specific implementation manner of step S106 in FIG. 1;

3 is a flowchart of a third specific implementation manner of step S106 in FIG. 1;

FIG. 4 is a schematic flowchart of a specific implementation of step S405 in FIG. 3;

Figure 5 is a schematic diagram of an artificial intelligence-based corpus collection device provided by an embodiment of the application;

FIG. 6 is a schematic diagram of the audio subtitle processing module in FIG. 5;

FIG. 7 is a schematic structural diagram of a second embodiment of the screening module in FIG. 5;

FIG. 8 is a schematic structural diagram of a third embodiment of the screening module in FIG. 5;

Fig. 9 is a schematic structural diagram of the voice state parameter score calculation module in Fig. 8;

Fig. 10 is a block diagram of the basic structure of a computer device 100 in an embodiment of the present application.

detailed description

In order to facilitate the understanding of the application, the application will be described in a more comprehensive manner with reference to the relevant drawings. The preferred embodiments of the application are shown in the drawings. However, this application can be implemented in many different forms and is not limited to the embodiments described herein. On the contrary, the purpose of providing these embodiments is to make the understanding of the disclosure of this application more thorough and comprehensive.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terms used in the specification of the application herein are only for the purpose of describing specific embodiments, and are not intended to limit the application.

The embodiment of the application provides a method for collecting corpus based on artificial intelligence. Refer to Fig. 1, which is a schematic diagram of artificial intelligence-based corpus collection provided by an embodiment of this application; as shown in Fig. 1, the artificial intelligence-based corpus collection method includes:

S101. Obtain configuration item information input by a user, where the configuration item information includes a target video keyword and a video website, and the video website is the URL of the video website or the name of the video website.

Among them, the target video keywords include keywords indicating the video name or video type; the video website may be the name of the video website, such as iQiyi, Youku, or the URL of the video website, such as iQiyi, Youku.

S102. Download from the video website the video data of the target video obtained by retrieving the target video keyword, the video data including a video file and an SRT subtitle file.

Specifically, the video data in the embodiments of this application includes subtitle files and video files with audio and video signals. The video data can be, for example, movies, TV shows, variety shows, news, animations, songs and other video data, or it can involve specific content. For example, video data such as consumer rights protection, complaints, ordering dialogues, and specific cartoon content.

Web crawlers (also known as web spiders or web robots) are computer-readable instructions or scripts that automatically crawl information on the World Wide Web in accordance with certain rules. Specifically, the implementation manner of downloading video data may include:

In the first way, the web crawler finds the video URL according to the video URL entered by the user, opens the web page containing the target video, and automatically downloads the target video. For example, when it is necessary to download the movie "When Happiness Comes Knock", the user can pre-set the video URL containing the movie, and the web crawler finds the video URL according to the video URL entered by the user, opens the web page containing the target video, and automatically downloads the target video.

The second method is to obtain the video website input by the user. The video website can be the name of the video website, such as iQiyi, Youku, or the URL of the video website, such as iQiyi, Youku. When the obtained video website is the name of the video website such as iQiyi, Youku, etc., the web crawler enters the name of the video website in the preset search engine like Baidu and other web pages to retrieve the URL of the video website, and then opens the video website such as iQiyi , Enter the target video keyword in the search box of video sites such as iQiyi and Youku to search for the target video. According to the search results, the web crawler opens the searched webpages in turn and downloads all the videos. When the obtained video website is the URL of the video website, such as the URL of iQiyi and Youku, the web crawler opens the corresponding video website, and enters the target video keyword in the search box of the video website to search for the target video, and the web crawler searches As a result, open the searched web pages one by one to download all the videos. The target video keyword may be, for example, the name of a cartoon such as "bear infestation", or may be a keyword that represents video content such as "cooking".

Target video keyword setting: In reality, there will be an automatic complaint handling platform for customer service. When we know that some movies are emotionally angry movies, we can preset the keywords to the movie’s name such as "XXX", or set it to To filter the keywords of the type of video resources that need to be downloaded, for example, we know that a certain type of program (such as mediation programs, after-sales rights protection programs) contains a lot of complaints, anger, and dissatisfaction. You can set this type of program The name of "consumption proposition" is the target video keyword; in some scenes, the atmosphere is more cheerful and lively. For example, early childhood education also involves a lot of voice recognition and other technologies. You can set a filter to filter the video resources that need to be downloaded. Types of keywords, for example, we know that a certain type of programs like animation programs are mostly programs watched by children, which are very suitable for early education. You can set the keyword as "animation + children".

Further, in order to indicate that what is needed is a video format resource, you can add "video" to the target video keyword, for example, "consumption proposition + video", "animation + toddler + video" and so on to limit the search to video resources.

The above are only enumerated and not used to limit this application.

S103. Separate the audio file from the video file, and divide the subtitle text content parsed from the SRT subtitle file into subtitle blocks.

Specifically, step S103 includes two sub-steps: separating the audio file from the video file, specifically, separating the audio in the video file through the video and audio separation technology to obtain a separate audio file. The subtitle text content parsed from the SRT subtitle file is divided into subtitle blocks.

The two sub-steps in step S103 belong to a parallel relationship, regardless of time sequence.

Specifically, in this embodiment, the subtitle text content shown below can be obtained by parsing the SRT subtitle file;

1

00:00:00,162-->00:00:01,875

from now on

2

00:00:02,800-->00:00:03,000

I only love you, spoil you, and will not lie to you

3

00:00:06,560-->00:00:11,520

No one can beat you, scold you, bully you, if someone bullies you, I will come out to help you the first time

Among them, "1", "2", and "3" represent the serial numbers of the subtitles. For example, "1" represents the first subtitle that appears in the audio signal, and "2" represents the second subtitle that appears in the audio signal, " 3" represents the third subtitle appearing in the audio signal; the audio signal mainly includes the part with subtitles and the blank part without subtitles. Each subtitle corresponds to two times. The first "time" ("--> The time on the left of "" indicates the start time of the subtitles in the audio signal, the second "time" (the time on the right of "-->") indicates the end time of the subtitles, from the start time to the end time is the playback time of the subtitles. For example, "00:00:00,162" is the start time of the first subtitle in the audio signal, "00:00:01,875" is the end time of the first subtitle, "00:00:00,162-->00:00: 01,875" is the playback time of the subtitle content "from now" of the first subtitle. "From now on" is the subtitle content of the first subtitle, "I only love you, spoil you, and will not lie to you" is the subtitle content of the second subtitle, "No one can beat you, scold you, bully you, If someone bullies you, I will come out to help you as soon as possible" is the subtitle content of the third subtitle.

Specifically, in this embodiment, the subtitle text content is divided into blocks by combining the playback time and the sentence breaks to obtain subtitle blocks; for example, "from now" is divided into one subtitle block, "I only love you, do Lie to you" is split into a subtitle block, "No one can beat you, scold you, bully you, and if someone bullies you, I will come out to help you immediately" split into a subtitle block. S104. Split the audio file according to the segmentation time of each subtitle block to obtain segmented audio.

In the parsed subtitle text, each subtitle corresponds to two times. The first "time" represents the start time of the subtitles in the audio signal, and the second "time" represents the end time of the subtitles, from start time to end Time is the playing time of subtitles. Since the subtitle block is split according to the playing time of the subtitle, the start time to the end time of each subtitle block can be obtained from the playing time of the subtitle, and then the audio file is divided according to the start time to the end time of each subtitle block , For example, split into "00:00:00,162--〉00:00:01,875", "00:00:02,800--〉00:00:03,000", "00:00:06,560--〉00:00: 11,520" and so on, a segment of segmented audio, and the final segmented audio has a one-to-one correspondence with the subtitle block.

S105. Establish an association between the segmented audio and the subtitle block.

Associate the segmented audio with the subtitle block, for example, the segmented audio with the time period of "00:00:00,162-->00:00:01,875" is associated with the segmented subtitle "from now on". The associated segmented audio and subtitle blocks can be stored in a designated folder address or stored separately, but the file names of the two must be consistent.

S106. The associated subtitle block segmented audio and subtitle block are classified and filtered according to preset filtering keywords, and then stored together as a target corpus.

In the embodiment of the present application, the configuration item information input by the user is obtained, and the video data of the target video is downloaded from the video website; then the video data is processed, the audio file is separated from the video file, and the SRT subtitle file is parsed The subtitle text content is divided into subtitle blocks; the audio is divided according to the segmentation time of each subtitle block; the segmented audio and the subtitle block are associated; the associated segmented audio and subtitle block are filtered according to the preset keywords After classification and screening, they are stored together as the target corpus, which realizes the purpose of quickly and automatically collecting the required corpus that meets a certain type of scene, such as the preset screening keywords, with high efficiency and low cost.

Specifically, in the first specific implementation manner of step S106, the step of classifying and filtering the associated segmented audio and subtitle blocks according to preset screening keywords and storing them as the target corpus together includes: analyzing each subtitle block Whether it contains text that matches the preset filtering keywords; store the subtitle block containing the matched text together with the segmented audio associated with the subtitle block in the designated first location.

Specifically, in the embodiment of the present application, the pre-selected keyword classification method is used to help filter the required corpus. Usually, people may use more insulting words when they are angry; when they are happy, they may use some positive words. Therefore, if you need to collect corpus of anger emotions, the preset screening keywords can be "too excessive", "I am angry", or "slut" or "idiot" that means cursing. If you need to collect corpus of positive emotions , The preset filtering keywords can be "Everyday Upward", "Struggle", "Refueling" and so on.

Grab the text of each subtitle block and compare it with the preset screening keywords to confirm whether the subtitle block contains text that matches the preset screening keywords, where the screening keyword matching method can be fuzzy matching . The stored associated segmented audio and subtitle blocks are the target corpus.

Please refer to FIG. 2, which is a flowchart of a second specific implementation manner of step S106 in FIG. 1;

Specifically, in some embodiments, the step of classifying and filtering the associated segmented audio and subtitle blocks according to preset screening keywords and storing them as the target corpus together includes: S301. Analyzing whether each subtitle block contains The text matching the preset filtering keywords; S302. Store the subtitle block containing the matching text together with the segmented audio associated with the subtitle block in the designated first position.

S303. Determine whether each voice state parameter of each segmented audio stored in the first position is in the preset standard interval; S304. Select all the segmented audios whose voice state parameters are in the preset standard interval together with the segment audio The subtitle blocks associated with the audio segment are stored together in the specified second location.

Specifically, the screening configuration item information in this embodiment includes not only screening keywords used for screening, but also voice state parameters used to assist in analyzing the emotional category of the segmented audio. The voice state parameters may include volume, frequency, Amplitude, speed of speech, and intonation.

Before the step of selecting the segmented audio of all the voice state parameters in the preset standard interval and storing the subtitle block associated with the segmented audio in the designated second position, it also includes: presetting each voice The standard interval of the state parameter.

Specifically, the step of presetting the standard interval of each voice state parameter includes: obtaining a corpus sample marked with the target emotion category for statistical analysis, and obtaining that the probability of each voice state parameter under the target emotion category is greater than a preset value The range of the speech state parameter; an interval included in the range is extracted from the range as a preset standard interval. Among them, the corpus sample can be manually collected samples that you think meet a certain type of emotion you want, or it can be an existing sample collected by other methods; the interval included in the range can be The interval with the same range may also be an interval within the range, for example, the range is 50 to 70, and the interval included in the range may be 50 to 70, or 50 to 60, 55 to 65, 60 to 70 and so on.

More specifically, in this embodiment, for example, regarding the voice state parameter of frequency, we find a corpus sample library marked with the target emotion category (such as anger), test the frequency value of each corpus sample, and plot the frequency probability Normal distribution graph, it is found that the probability of all corpus samples whose frequencies are in the range of 50～70Hz in all corpus samples is greater than the preset value (for example, 97%), and the frequency of the target emotion category can be obtained The probability of each is greater than the range of the preset value of the voice state parameter. Similarly, the method can be used to obtain that the probability of each voice state parameter under the target emotion category is greater than the range of the preset value of the voice state parameter; The range interval is used as the preset standard interval. You can also select a cell in the interval 50～70Hz, such as 50～60Hz, 55～65Hz, 60～70Hz as the preset standard interval, and other speech state parameters such as volume, amplitude, speech rate And intonation.

All the voice state parameters are in the segmented audio of the preset standard interval, that is, the five voice state parameters of the segmented audio are in their respective preset standard intervals.

Please refer to FIG. 3, which is a flowchart of a third specific implementation manner of step S106 in FIG. 1;

Step S302, after storing the subtitle block containing the matching text together with the segmented audio associated with the subtitle block in the designated first location, further includes: S405. Calculate each subtitle block stored in the first location The score of each voice state parameter of the segmented audio. S406. Perform a summation operation on the scores of all voice state parameters in the same segmented audio, and confirm whether the total score reaches a preset threshold. Among them, the preset threshold can be set by experience or demand, for example, 80 points, 90 points, etc. S407. Store the segmented audio whose total score reaches the preset threshold, together with the subtitle block associated with the segmented audio, to the specified third location.

Specifically, the screening configuration item information in this embodiment includes not only screening keywords used for screening, but also voice state parameters used to assist in analyzing the emotions of segmented audio. The voice state parameters include volume, frequency, amplitude, Speaking speed and intonation.

Next, please refer to FIG. 4, which is a schematic flowchart of a specific implementation of step S405 in FIG. 3; more specifically, the score of each voice state parameter of each segment audio stored in the first position is calculated The value steps include:

S501. Obtain a corpus sample marked with a target emotion category for statistical analysis, and obtain that the probability of each voice state parameter under the target emotion category is greater than the range of the voice state parameter of the preset value.

Among them, the corpus samples can be manually collected samples that you think meet a certain type of emotion you want, or they can be obtained corpus samples collected by other methods. In this embodiment, for example, regarding the speech state parameter of frequency, we find a corpus sample library marked with the target emotion category (such as anger), test the frequency value of each corpus sample, and draw the probability normal distribution diagram of the frequency , It is found that the probability of samples with a frequency in the range of 50～70Hz in all corpus samples is greater than the preset value (for example, 97%), and the probability of the frequency in the target emotion category is greater than the preset value The range of the speech state parameters, in the same way, the probability of each speech state parameter under the target emotion category can be obtained by using this method to be greater than the preset value of the speech state parameter range.

S502. Select a value such as the median value in the range as the preset standard value of the voice state parameter.

Among them, the frequency standard value _is represented by W _frequency , and W _volume , W _amplitude , W _{speech rate,} and W _intonation respectively represent the preset standard values of other speech state parameters.

S503. Test each voice state parameter value of each segment of audio stored in the first position.

S504. Based on the preset standard value of the voice state parameter, the tested voice state parameter value and the received weight value, calculate the score of each voice state parameter according to the following formula:

_{_{M i = 100 * S i *}} (X i / W i); wherein, M _i is the score for each speech state parameter, S _i is a weight value for each speech state parameter, X _i is a speech state parameters of the test Value, W _i is the preset standard value of the voice state parameter, and _i represents the voice state parameter, which can specifically be volume, frequency, amplitude, speech rate, and intonation.

Specifically, the audio segment stored in the first position out of the actual test voice state specific parameters X _i value with a preset reference value voice state parameter W _i are compared, a similarity value, called P obtained _i , namely P _i =X _i /W _i ;

For example, comparing the actual measured frequency value of each segment audio stored in the first position with the preset frequency standard value W _frequency to obtain the frequency similarity P _frequency , and the frequency specific value _is represented by X _frequency , X _volume , X _amplitude , X _{speech rate} , and X _intonation respectively represent the specific values of other speech state parameters. P _volume , P _amplitude , P _{speech rate} and P _intonation respectively represent the similarity of other speech state parameters. The specific formula is as follows:

P _volume = X _volume / W _volume , P _frequency = X _frequency / W _frequency , P _amplitude = X _amplitude / W _amplitude , P _{speech rate} = X _{speech rate} / W _{speech rate} ,

P _intonation =X _intonation /W _intonation .

Receive the preset weight value of each voice state parameter. The weight value is represented by S _i . The weight value of each speech state parameter is S _volume , S _frequency , S _amplitude , S _{speech rate,} and S _intonation . Set a weight value for each speech state parameter in advance, for example, when a person is angry At that time, the sound is obviously much louder, so the weight of the volume is relatively large, which can be set to 60%.

From M _i =100*S _i *P _i , the formula M _i =100*S _i *(X _i /W _i ) is further obtained. Refer to the formula M _i =100*S _i *(X _i /W _i ) to calculate each The score of a voice state parameter. Specifically, refer to the following formula:

M _volume = 100 * S _volume * (X _volume / W _volume ), M _frequency = 100 * S _frequency * (X _frequency / W _frequency ), M _amplitude = 100 * S _amplitude * (X _amplitude / W _amplitude ), M _{language Speed} =100*S _{speech speed} *(X _{speech speed} /W _{speech speed} ), M _intonation =100*S _intonation *(X _intonation /W _intonation )

Among them, M _volume , M _frequency , M _amplitude , M _{speech rate,} and M _intonation respectively represent the scores of each speech state parameter; S _volume , S _frequency , S _amplitude , S _{speech rate,} and S _intonation respectively represent each speech state parameter The weight value;

Therefore, the score of each speech state parameter can be calculated.

In turn, continue to refer to FIG. 3, S406. Perform a summation operation on the scores of all the voice state parameters in the same segmented audio to confirm whether the total score reaches the preset threshold. Specifically, M represents the total score of the segmented audio, and the total score of the same segmented audio is obtained according to the formula M=M _volume +M _frequency +M _amplitude +M _{speech rate} +M _intonation .

Compare the total score of the same audio segment with the preset threshold to confirm whether the total score reaches the preset threshold.

S407. Store the segmented audio whose total score reaches the preset threshold, together with the subtitle block associated with the segmented audio, to the specified third location. Specifically, if the total score is greater than or equal to the preset threshold, the segmented audio is stored in the specified third location together with the subtitle block associated with the segmented audio.

The advantage brought by the embodiments of the present application is to realize the purpose of automatically and quickly collecting the required corpus that meets a certain type of scene, with high efficiency and low cost, and setting a variety of voice state parameters. Perform statistical analysis on the corpus samples to obtain a range, select the range within the range as the preset standard interval or select a specific value in the range as the preset standard value, test the segmented audio, calculate the score, and select The emotion of the target corpus is more in line with the standard.

An embodiment of the present application provides an artificial intelligence-based corpus collection device. Refer to FIG. 5, which is a schematic structural diagram of a first embodiment of the artificial intelligence-based corpus collection device described in this application;

The artificial intelligence-based corpus collection device includes: a configuration item information acquisition module 51, a video data download module 52, an audio subtitle processing module 53, an audio segmentation module 54, an audio subtitle block association module 55, and a screening module 56. The configuration item information obtaining module 51 is configured to obtain configuration item information input by the user; wherein the configuration item information includes a target video keyword and a video website, and the video website is the URL of the video website or the name of the video website. The video data download module 52 is used to download the video data of the target video obtained by retrieving the target video keyword from the video website, the video data includes a video file and an SRT subtitle file; the audio subtitle processing module 53 is used to Separate the audio file from the video file, and split the subtitle text content parsed from the SRT subtitle file into subtitle blocks; the audio segmentation module 54 is used to segment the audio file according to the segmentation time of each subtitle block to obtain Segment audio; audio subtitle block association module 55, used to establish the association between segmented audio and subtitle blocks; filtering module 56 used to classify and filter the associated segmented audio and subtitle blocks according to preset filtering keywords Stored together as the target corpus.

Referring to FIG. 6, FIG. 6 is a schematic structural diagram of the audio subtitle processing module in FIG. 5; specifically, in an embodiment of the present application, the audio subtitle processing module 53 includes: a subtitle splitting module 531, configured to parse SRT subtitle files to obtain subtitles Text content; combine the playback time and sentence breaks to divide the subtitle text content into blocks to obtain subtitle blocks; the audio and video separation module 532 is used to separate audio files from the video files.

Refer to FIG. 7, which is a schematic structural diagram of the first embodiment of the screening module in FIG. 5; specifically, in some embodiments, the screening module 56 includes: a keyword matching module 561 for analyzing whether each subtitle block Contains text matching the preset filtering keywords; the first storage module 562 is used to store the subtitle block containing the matching text together with the segmented audio associated with the changed subtitle block in the specified first location .

Further, in other embodiments, the screening module 56 includes, in addition to the keyword matching module 561 and the first storage module 562, a voice state parameter judgment module 563 for judging the score stored in the first location. Whether each voice state parameter of a segment of audio is within a preset standard interval; wherein, the voice state parameter is included in the preset filtering configuration item information to assist in analyzing the emotion of the segmented audio; the second storage module 564 uses After selecting the segmented audio with all the speech state parameters in the preset standard interval, the subtitle block associated with the segmented audio is stored in the designated second location.

Referring to FIG. 8, FIG. 8 is a schematic structural diagram of the second embodiment of the screening module in FIG. 5; specifically, in other embodiments, the screening module 56 includes a keyword matching module 561 and a first storage module 562. It also includes: a voice state parameter score calculation module 565, configured to calculate the score of each voice state parameter of each segment of audio stored in the first position; a total score calculation and judgment module 566, configured to combine the same The scores of all speech state parameters in the segmented audio are summed, and it is confirmed whether the total score reaches the preset threshold; the third storage module 567 is used to combine the segmented audio with the total score reached the preset threshold with The subtitle block associated with the segmented audio is stored together in the specified third location.

Referring to Figure 9, Figure 9 is a schematic structural diagram of the voice state parameter score calculation module in Figure 8; specifically, the voice state parameter score calculation module 565 includes: a range analysis module 5651 for obtaining corpus marked with target emotion categories The samples are analyzed statistically, and the probability of each speech state parameter under the target emotion category is greater than the preset value of the speech state parameter range; among them, the corpus sample can be manually collected by the person who thinks it meets a certain category that you want The samples of the target emotion category may also be obtained samples collected by other methods. In this embodiment, for example, regarding the speech state parameter of frequency, we find a corpus sample library marked with the target emotion category (such as anger), test the frequency value of each corpus sample, and draw the probability normal distribution diagram of the frequency , It is found that the probability of samples with a frequency in the range of 50～70Hz in all corpus samples is greater than the preset value (for example, 97%), you can get the probability that the frequency of the target emotion category is greater than the preset value The range of the speech state parameters, in the same way, the probability of each speech state parameter under the target emotion category can be obtained by using this method to be greater than the preset value of the speech state parameter range. The standard value setting module 5652 is used to select a value within the range as the preset standard value of the speech state parameter; among them, the frequency standard value _is represented by W _frequency , W _volume , W _amplitude , W _{speech speed} and W _intonation respectively The preset standard value of other speech state parameters, the value selected as the preset standard value can be a median value or any value within the range, for example, in this embodiment, the median frequency of 60 Hz is selected as the preset standard value of frequency . The test value module 5653 is used to test each voice state parameter value of each segment audio stored in the first position. The score calculation module 5654 is used to calculate the score of each voice state parameter based on the preset standard value of the voice state parameter, the tested voice state parameter value and the received weight value according to the following formula:

An embodiment of the application discloses a computer device. For details, please refer to FIG. 10, which is a block diagram of the basic structure of the computer device 100 in an embodiment of the application. As shown in FIG. 10, the computer device 100 includes a memory 101, a processor 102, and a network interface 103 that are communicatively connected to each other through a system bus. It should be pointed out that FIG. 10 only shows the computer device 100 with components 101-103, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.

The memory 101 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 101 may be an internal storage unit of the computer device 100, such as a hard disk or memory of the computer device 100. In other embodiments, the memory 101 may also be an external storage device of the computer device 100, for example, a plug-in hard disk, a smart media card (SMC), and a secure digital device equipped on the computer device 100. (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 101 may also include both an internal storage unit of the computer device 100 and an external storage device thereof. In this embodiment, the memory 101 is generally used to store an operating system and various application software installed in the computer device 100, such as the aforementioned artificial intelligence-based corpus collection method. In addition, the memory 101 can also be used to temporarily store various types of data that have been output or will be output.

In some embodiments, the processor 102 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 102 is generally used to control the overall operation of the computer device 100. In this embodiment, the processor 102 is configured to run computer-readable instructions or process data stored in the memory 101, for example, run the computer-readable instructions of the aforementioned artificial intelligence-based corpus collection method.

The network interface 103 may include a wireless network interface or a wired network interface, and the network interface 103 is generally used to establish a communication connection between the computer device 100 and other electronic devices.

This application also provides another implementation manner, that is, to provide a non-volatile readable storage medium, the non-volatile readable storage medium stores a document information entry process, and the document information entry process can be at least One processor executes, so that the at least one processor executes the steps of any of the foregoing artificial intelligence-based corpus collection methods.

Finally, it should be noted that, obviously, the embodiments described above are only a part of the embodiments of this application, not all of them. The drawings show the preferred embodiments of this application, but do not limit the patents of this application. range. This application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of this application more thorough and comprehensive. Although the application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made by using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims

A corpus collection method based on artificial intelligence, characterized in that it includes:

Acquiring configuration item information input by the user, where the configuration item information includes a target video keyword and a video website, and the video website is the URL of the video website or the name of the video website;

Downloading from the video website the video data of the target video obtained by retrieving the target video keyword, the video data including a video file and an SRT subtitle file;

Separating an audio file from the video file, and splitting the subtitle text content parsed from the SRT subtitle file into subtitle blocks;

Segment the audio file according to the segment time of each subtitle block to obtain segmented audio;

Establishing an association between the segmented audio and the subtitle block;

The associated segmented audio and subtitle blocks are classified and filtered according to preset filtering keywords, and then stored together as the target corpus.
The artificial intelligence-based corpus collection method according to claim 1, wherein the step of classifying and filtering the associated segmented audio and subtitle blocks according to preset filtering keywords and storing them as the target corpus specifically includes :

Analyze whether each subtitle block contains text that matches the preset filtering keywords;

The subtitle block containing the matching text is stored in the designated first location along with the segmented audio associated with the subtitle block.
The artificial intelligence-based corpus collection method according to claim 2, wherein the subtitle block containing the matching text is stored in the designated first subtitle block together with the segmented audio associated with the subtitle block. After the location steps, it also includes:

Determine whether each voice state parameter of each segmented audio stored in the first position is within a preset standard interval;

The selected segmented audio with all the voice state parameters in the preset standard interval is stored in the designated second location together with the subtitle block associated with the segmented audio.
The artificial intelligence-based corpus collection method of claim 3, wherein the setting method of the preset standard interval specifically includes:

Obtain corpus samples marked with the target emotion category for statistical analysis, and obtain that the probability of each speech state parameter under the target emotion category is greater than the range of the speech state parameter of the preset value;

An interval included in the reference range is extracted from the range as a preset standard interval.
The artificial intelligence-based corpus collection method according to claim 2, wherein the subtitle block containing the matching text is stored in the designated first subtitle block together with the segmented audio associated with the subtitle block. After the location steps, include:

Calculating the score of each voice state parameter of each segmented audio stored in the first position;

Sum the scores of all voice state parameters in the same segmented audio to confirm whether the total score reaches the preset threshold;

The segmented audio whose total score reaches the preset threshold is stored together with the subtitle block associated with the segmented audio to the specified third location.
The artificial intelligence-based corpus collection method according to claim 5, wherein the step of calculating the score of each voice state parameter of each segmented audio stored in the first position specifically comprises:

Obtain corpus samples marked with the target emotion category for statistical analysis, and obtain that the probability of each speech state parameter under the target emotion category is greater than the range of the speech state parameter of the preset value;

Select a value within the range as the preset standard value of the voice state parameter;

Testing each voice state parameter value of each audio segment stored in the first position;

Based on the preset standard value of the voice state parameter, the tested voice state parameter value and the received weight value, the score of each voice state parameter is calculated according to the following formula:

M i = 100 * S i * (X i / W i); wherein, M i is the score for each speech state parameter, S i is a weight value for each speech state parameter, X i is a speech state parameters of the test Value, W i is the preset standard value of the voice state parameter, and i represents the voice state parameter.
The artificial intelligence-based corpus collection method according to claim 4 or 6, wherein the step of obtaining a corpus sample marked with a target emotion category for statistical analysis comprises:

Obtain a corpus sample library marked with the target emotion category;

Test the frequency value of each corpus sample in the corpus sample library to obtain a probability normal distribution map of the frequency value;

Perform statistical analysis based on the probability normal distribution map.
The artificial intelligence-based corpus collection method according to any one of claims 1 to 7, wherein the step of splitting the subtitle text content parsed from the SRT subtitle file into subtitle blocks specifically comprises:

Parse the SRT subtitle file to get the subtitle text content;

The subtitle text content is divided into blocks by combining the playing time and sentence breaks to obtain subtitle blocks.
A corpus collection device based on artificial intelligence, characterized in that it comprises:

The configuration item information obtaining module is used to obtain the configuration item information input by the user, the configuration item information includes the target video keyword and the video website, and the video website is the URL of the video website or the name of the video website;

A video data download module, configured to download video data of the target video obtained by retrieving the target video keyword from the video website, the video data including a video file and an SRT subtitle file;

Audio subtitle processing module, used to separate audio files from video files, and split subtitle text content parsed from SRT subtitle files into subtitle blocks;

The audio segmentation module is used to segment the audio file according to the segmentation time of each subtitle block to obtain segmented audio;

Audio subtitle block association module, used to establish the association between segmented audio and subtitle blocks;

The screening module is used to classify and screen the associated segmented audio and subtitle blocks according to preset screening keywords and store them together as a target corpus.
The artificial intelligence-based corpus collection device according to claim 9, wherein the audio subtitle processing module comprises:

The subtitle splitting module is used to parse the SRT subtitle file to obtain the subtitle text content; combine the playback time and the hyphen to block the subtitle text content to obtain the subtitle block; the audio and video separation module 532 is used to separate the audio file from the video file .
A computer device comprising a memory and a processor, wherein the memory stores computer readable instructions, and when the processor executes the computer readable instructions, the following steps of the artificial intelligence-based corpus collection method are implemented :

Acquiring configuration item information input by the user, where the configuration item information includes a target video keyword and a video website, and the video website is the URL of the video website or the name of the video website;

Downloading from the video website the video data of the target video obtained by retrieving the target video keyword, the video data including a video file and an SRT subtitle file;

Separating an audio file from the video file, and splitting the subtitle text content parsed from the SRT subtitle file into subtitle blocks;

Segment the audio file according to the segment time of each subtitle block to obtain segmented audio;

Establishing an association between the segmented audio and the subtitle block;

The associated segmented audio and subtitle blocks are classified and filtered according to preset filtering keywords, and then stored together as the target corpus.
The computer device according to claim 11, wherein the step of classifying and filtering the associated segmented audio and subtitle blocks according to preset filtering keywords and storing them as the target corpus together specifically comprises:

Analyze whether each subtitle block contains text that matches the preset filtering keywords;

The subtitle block containing the matching text is stored in the designated first location along with the segmented audio associated with the subtitle block.
The computer device according to claim 12, wherein after the step of storing the subtitle block containing the matching text together with the segmented audio associated with the subtitle block in the designated first location, Also includes:

Determine whether each voice state parameter of each segmented audio stored in the first position is within a preset standard interval;

The selected segmented audio with all the voice state parameters in the preset standard interval is stored in the designated second location together with the subtitle block associated with the segmented audio.
The computer device according to claim 13, wherein the method for setting the preset standard interval specifically comprises: obtaining a corpus sample marked with a target emotion category for statistical analysis to obtain each speech state parameter under the target emotion category The occupied probability is greater than the range of the preset value of the voice state parameter;

An interval included in the reference range is extracted from the range as a preset standard interval.
The computer device according to claim 12, wherein the step of storing the subtitle block containing the matching text together with the segmented audio associated with the subtitle block in the specified first location is further followed by include:

Calculating the score of each voice state parameter of each segmented audio stored in the first position;

Sum the scores of all voice state parameters in the same segmented audio to confirm whether the total score reaches the preset threshold;

The segmented audio whose total score reaches the preset threshold is stored together with the subtitle block associated with the segmented audio to the specified third location.
One or more non-volatile readable storage media, wherein the non-volatile readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by a processor, The steps of the artificial intelligence-based corpus collection method are as follows: obtaining configuration item information input by the user, the configuration item information including target video keywords and video website, and the video website is the URL of the video website or the name of the video website;

Downloading from the video website the video data of the target video obtained by retrieving the target video keyword, the video data including a video file and an SRT subtitle file;

Separating an audio file from the video file, and splitting the subtitle text content parsed from the SRT subtitle file into subtitle blocks;

Segment the audio file according to the segment time of each subtitle block to obtain segmented audio;

Establishing an association between the segmented audio and the subtitle block;

The associated segmented audio and subtitle blocks are classified and filtered according to preset filtering keywords, and then stored together as the target corpus.
The non-volatile readable storage medium according to claim 16, wherein the step of classifying and filtering the associated segmented audio and subtitle blocks according to preset filtering keywords and storing them together as the target corpus is specifically include:

Analyze whether each subtitle block contains text that matches the preset filtering keywords;

The subtitle block containing the matching text is stored in the designated first location along with the segmented audio associated with the subtitle block.
The non-volatile readable storage medium according to claim 17, wherein the subtitle block containing the matching text is stored in the designated first subtitle block together with the segmented audio associated with the subtitle block After the steps of a position, it also includes:

Determine whether each voice state parameter of each segmented audio stored in the first position is within a preset standard interval;

The selected segmented audio with all the voice state parameters in the preset standard interval is stored in the designated second location together with the subtitle block associated with the segmented audio.
The non-volatile readable storage medium according to claim 18, wherein the method for setting the preset standard interval specifically comprises: obtaining a corpus sample marked with a target emotion category for statistical analysis to obtain the target emotion category The probability occupied by each voice state parameter is greater than the preset value of the voice state parameter range;

An interval included in the reference range is extracted from the range as a preset standard interval.
The non-volatile readable storage medium according to claim 17, wherein the subtitle block containing the matching text is stored in the designated first subtitle block together with the segmented audio associated with the subtitle block After the step of a position, it further includes: calculating the score of each voice state parameter of each segment of audio stored in the first position;

Sum the scores of all voice state parameters in the same segmented audio to confirm whether the total score reaches the preset threshold;

The segmented audio whose total score reaches the preset threshold is stored together with the subtitle block associated with the segmented audio to the specified third location.
[Corrected according to Rule 26 on 25.11.2019]

The non-volatile readable storage medium according to claim 19, wherein the step of calculating the score of each voice state parameter of each segmented audio stored in the first position specifically comprises:

Obtain corpus samples marked with the target emotion category for statistical analysis, and obtain that the probability of each speech state parameter under the target emotion category is greater than the range of the speech state parameter of the preset value;

Select a value within the range as the preset standard value of the voice state parameter;

Testing each voice state parameter value of each audio segment stored in the first position;

Based on the preset standard value of the voice state parameter, the tested voice state parameter value and the received weight value, the score of each voice state parameter is calculated according to the following formula:

M i = 100 * S i * (X i / W i); wherein, M i is the score for each speech state parameter, S i is a weight value for each speech state parameter, X i is a speech state parameters of the test Value, W i is the preset standard value of the voice state parameter, and i represents the voice state parameter.