WO2022091304A1

WO2022091304A1 - Categorization apparatus, control device, categorization method, control method and computer readable medium

Info

Publication number: WO2022091304A1
Application number: PCT/JP2020/040660
Authority: WO
Inventors: Alexander Viehweider
Original assignee: Nec Corporation
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-05-05
Also published as: JP7485217B2; JP2023546189A

Abstract

An object of the present disclosure is to provide a categorization apparatus capable of providing support for humans. The categorization apparatus includes a generation means (11) for determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region; a categorization means (12) for categorizing the partial video data generated by the generation means; and a modification means (13) for modifying the predetermined algorithm based on evaluation of the categorization executed by the categorization means.

Description

CATEGORIZATION APPARATUS, CONTROL DEVICE, CATEGORIZATION METHOD, CONTROL METHOD AND COMPUTER READABLE MEDIUM

　　The present disclosure relates to a categorization apparatus, a control device, a categorization method, a control method and a non-transitory computer readable medium.

　　The technology of image analysis and video analysis has been developed rapidly.

　　For example, PTL 1 discloses a display control device, which is capable of making a thumbnail of a scene. Specifically, the display control device makes clustering results, wherein each frame of a content is subjected to clustering, to display thumbnails. The scene classifying unit 612 of the display control device classifies a frame belonging to the cluster of interest into a scene with a frame group of one or more frames. The thumbnail creating unit 613 of the display control device creates the thumbnail of each scene based on the scene information from the scene classifying unit 612.

PTL 1: JP5533861B2

　　In recent years, a technology for supporting human activity by a machine (For example, computers, support robots, etc.) has been developed. In such techniques, it is important for machines to detect and categorize human motion sequences in order to achieve the support desired by humans.

　　An object of the present disclosure is to provide a categorization apparatus, a control device, a categorization method, a control method and a non-transitory computer readable medium capable of providing human support (understood as support for humans).

　　In a first example aspect, a categorization apparatus includes: a generation means for determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region; a categorization means for categorizing the partial video data generated by the generation means; and a modification means for modifying the predetermined algorithm based on evaluation of the categorization executed by the categorization means.

　　In a second example aspect, a control device includes: a recognition means for recognizing video data containing an operation and, thereby, determining the operation; and a controller for determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.

　　In a third example aspect, a categorization method includes: determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region; categorizing the partial video data; and modifying the predetermined algorithm based on evaluation of the categorization.

　　In a fourth example aspect, a control method includes: recognizing video data containing an operation and, thereby, determining the operation; and determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.

　　In a fifth example aspect, a non-transitory computer readable medium storing a program to causes a computer to execute: determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region; categorizing the partial video data; and modifying the predetermined algorithm based on evaluation of the categorization.

　　In a sixth example aspect, a non-transitory computer readable medium storing a program for causing a computer to execute: recognizing video data containing an operation and, thereby, determining the operation; and determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.

　　According to the present disclosure, it is possible to provide a categorization apparatus, a control device, a categorization method, a control method and a non-transitory computer readable medium capable of providing support for humans.

Fig. 1 is a block diagram of a categorization apparatus according to a first example embodiment. Fig. 2 is a flowchart illustrating a method for categorizing video data according to the first example embodiment. Fig. 3 is a block diagram of a control device according to a second example embodiment. Fig. 4 is a flowchart illustrating a method for controlling a machine according to the second example embodiment. Fig. 5 is a block diagram of a categorization system according to a third example embodiment. Fig. 6 is a block diagram of a generation unit according to the third example embodiment. Fig. 7 is a graph showing an example of an intensity signal of the video data according to the third example embodiment. Fig. 8A is a picture showing examples of human motions of each subsequence according to the third example embodiment. Fig. 8B is a table showing examples of categories and category labels corresponding to subsequences according to the third example embodiment. Fig. 9 is a graph showing an example of an intensity signal of the video data according to the third example embodiment. Fig. 10 is a table showing examples of categories and category labels corresponding to subsequences according to the third example embodiment. Fig. 11 is a schematic view of a feedback process according to the third example embodiment. Fig. 12 is a graph showing an example of transition of the number of reasonable categorization solutions according to the third example embodiment. Fig. 13 is a block diagram of an intention detection system according to a fourth example embodiment. Fig. 14 is a block diagram of a machine including the intention detection system according to a fifth example embodiment. Fig. 15 is a picture showing an example of a picking robot including the intention detection system according to the fifth example embodiment. Fig. 16A is a picture showing one example process of the picking robot instructed by human gesture according to the fifth example embodiment. Fig. 16B is a picture showing another example process of the picking robot instructed by human gesture according to the fifth example embodiment. Fig. 17 is a configuration diagram of an information processing apparatus according to example embodiments.

　　(First Example Embodiment)
　　A first example embodiment of the disclosure is explained below with reference to the drawings. Referring to Fig. 1, a categorization apparatus 10 includes a generation unit 11, a categorization unit 12 and a modification unit 13. The categorization apparatus 10 may be applied to various computers or machines capable of dealing with video data. For example, the categorization apparatus 10 may be installed as a personal computer, a video recorder, a robot, a machine, a television (TV) set, a cell phone, or the like.

　　The generation unit 11 determines a certain time region of the video data based on predetermined algorithm, and generates partial video data from which the video data is extracted in the certain time region. The video data has a certain length of time and the certain time region is within the certain length of time. The video data may be a sequence of image data - that is, it may comprise a number of frames. The generation unit 11 may analyze the contents of the video data by using the predetermined algorithm and set the certain time region. The video data may be stored in a memory in the categorization apparatus 10 or may be input to the generation unit 11 from outside of the categorization apparatus 10. Furthermore, the predetermined algorithm may be stored in a memory in the categorization apparatus 10.

　　The categorization unit 12 categorizes the partial video data generated by the generation unit 11. The categorization can be done by using numbers, texts or the like. The categorization may be related to human motions, such as gesture, specific scenes in a TV program or movie, but not limited to these.

　　The modification unit 13 modifies the predetermined algorithm based on evaluation of the categorization executed by the categorization unit 12. The evaluation may be processed by a component in the categorization apparatus 10, but it may be also processed by an apparatus outside the categorization apparatus 10.

　　Fig. 2 is a flowchart showing an example of processing executed by the categorization apparatus 10 according to the first example embodiment. The processing executed by the categorization apparatus 10 will be described below.

　　First, the generation unit 11 determines the certain time region of the video data based on predetermined algorithm (Step S11). Next, it generates partial video data from which the video data is extracted in the certain time region (Step S12). This partial video data may be referred to one scene and show a kind of human motion, but it is not limited to this.

　　Then, the categorization unit 12 categorizes the partial video data generated by the generation unit 11 (Step S13). Through this process, the categorization unit 12 may classify various partial video data into a plurality of categories.

　　After that, the modification unit 13 modifies the predetermined algorithm if necessary based on evaluation of the categorization executed by the categorization unit 12 (Step S14). As a result of the modifying the predetermined algorithm, the certain time region can be changed along with the result of the evaluation. Therefore, the generation unit 11 can generate the partial video data so that the partial video data contains a scene to be categorized more precisely. For example, if the categorization is aimed to categorize human motion subsequences, the categorization apparatus 10 can determine suitable time region of the partial video data for denoting one human motion subsequence. As a consequence, the categorization unit 12 can categorize the partial video data more accurately, because the partial video data can show exact one human motion subsequence.

　　(Second Example Embodiment)
　　A second example embodiment of the disclosure is explained below with reference to the drawings. Referring to Fig. 3, a control device 14 includes a recognition unit 15 and a controller 16. The control device 14 may be applied to a device installed in various computers or machines, for example, robots for supporting human.

　　The recognition unit 15 recognizes video data containing an operation and, thereby, determines the operation. The video data may denote human motion, and the human motion can also be the operation to a certain object. For example, the operations include an operation that the certain object is grasped, an operation that the certain object is put down, and the like. This gesture can instruct a robot to perform some process and can be implicit and explicit. The video data can be categorized as shown in the First Example Embodiment.

　　The controller 16 determines a motion of a machine depending on the determined operation and controls the machine in accordance with the determined operation. The machine may be the one including the control device 14, however, it is not limited to this.

　　Fig. 4 is a flowchart showing an example of processing executed by the control device 14 according to the second example embodiment. The processing executed by the control device 14 will be described below.

　　First, the recognition unit 15 recognizes video data containing an operation (Step S15). As mentioned before, the operation may be a human motion. Next, by recognizing the video data, it determines the operation (Step S16).

　　Then, the controller 16 determines a motion of the machine depending on the determined operation (Step S17). After that, the controller 16 controls the machine in accordance with the determined operation (Step S18). For example, if a user performs the operation, the recognition unit 15 can understand what the user wants the machine to do and the controller 16 can control the machine as instructed by the user and some other input. Specifically, by this process, the control device 14 can control the machine via recognizing intention of human.

　　The control device 14 according to the second example embodiment can achieve reduction of, for example, a function of system integration in machines such as a robot, a computer, and or the like.

　　The recognition unit 15 is achievable by functions of the categorization unit 12 and/or the modification unit 13 in Fig. 1. Furthermore, the recognition unit 15 may be achievable by functions of a preprocessor unit 21, a generation unit 22, a categorization unit 23, a mapping unit 24, and/or a modification unit 25 in Fig. 5. The recognition unit 15 may be achievable by functions of a computation unit 26, a signal-analysis unit 27, a determination unit 28, and/or a subsequence generation unit 29 in Fig. 6. In addition, the recognition unit 15 may be achievable by functions of a human-object analysis unit 31 and/or an intention detection unit 32 in Fig. 13. The recognition unit 15 may be achievable by a pattern recognition algorithm and/or an image recognition algorithm in a field of computer vision. Further, the controller 16 is achievable by functions of a signal generator 41, and/or an optimizer controller 42 in Fig. 14. The details of Figs. 5, 6, 13 and 14 will be described later.

　　(Third Example Embodiment)
　　A third example embodiment of the disclosure is explained below with reference to the drawings. The third example embodiment is a specific example of the first example embodiment.

　　First, the configuration and processing of the categorization system according to the third example embodiment will be described. Referring to Fig. 5, the categorization system 20 includes a preprocessor unit 21, a generation unit 22, a categorization unit 23, a mapping unit 24, a modification unit 25 and a database (DB). The categorization system 20 may be provided, for example, as a module of a machine or a robot. The categorization system 20 may receive raw video data from a sensory input or an imaging section (not shown in Fig. 5), for example, a video camera. The imaging section can catch at certain intervals a frame of a person.

　　The preprocessor unit 21 receives the raw video data and preprocesses (i.e. preliminary processes) it. Specifically, the preprocessor unit 21 reduces information contained in the raw data and generate the preprocessed video data (hereinafter simply referred to as video data) containing information related to categorization, which is done by the categorization unit 23. For example, the preprocessor unit 21 may reduce an irregularly sampled sequence of higher definition frames to frames with a low number of data points containing relevant information. The relevant information may include characteristic body points of the person to be photographed. Also, the relevant information may include the relationship the person with an object, which is operated by the person or is located nearby the person.

　　The preprocessor unit 21 outputs the video data to the generation unit 22. The preprocessor unit 21 may be realized by the combination of preprocessor software and a processor in the categorization system 20.

　　The generation unit 22 receives the video data from the preprocessor unit 21 and generates subsequences (the partial video data) from which the video data is extracted in the certain time region. To that end, the generation unit 22 determines the certain time region of the video data based on predetermined algorithm. In another words, the generation unit 22 may perform as a division generation unit, which divides the video into several subsequences.

　　Fig. 6 is a block diagram of the generation unit 22. The generation unit 22 includes a computation unit 26, signal-analysis unit 27, a determination unit 28 and a subsequence-generation unit 29. The detailed processes in the generation unit 22 will be explained.

　　The computation unit 26 computes an intensity signal of the video data to determine the length of the subsequences and the position of the subsequences in the video data (i.e. the certain time region), while the intensity signal denotes motion of the person. In detail, the intensity signal summarizes the dynamic movements of the acting person by a scalar signal and by determining characteristic points of this signal. The computation unit 26 uses the predetermined algorithm, which may be expressed as formula(s) and/or rule(s), to calculate the intensity signal. The computation unit 26 outputs the intensity signal to the signal-analysis unit 27.

　　The signal-analysis unit 27 analyzes the intensity signal and specify candidate point(s) in the intensity. The candidate point(s) are candidates of characteristic point(s) for the determining the length of the subsequence(s) and the position of the subsequence(s) in the video data.

　　The determination unit 28 determines characteristic point(s) from candidate point(s) specified by the signal-analysis unit 27. The signal-analysis unit 27 and the determination unit 28 uses rule base included in the predetermined algorithm to do the above-mentioned processes. This is how characteristic point(s) in the video data are derived.

　　The subsequence-generation unit 29 utilizes the characteristic point(s) determined by the determination unit 28 to decide the length of the subsequence(s) and the position of the subsequence(s) in the video data. The subsequence-generation unit 29 uses generation law included in the predetermined algorithm to decide these factors. The subsequence-generation unit 29 generates subsequence(s) from the sequences of frames (i.e. the video data).

　　To summarize, the generation unit 22 can generate a series of subsequences from the sequence of frames of the video data based on the predetermined algorithm, which includes the generation law and the appropriate rule base. The data of each subsequence is used as a motion subsequence candidate by the categorization unit 23. The generation unit 22 outputs the generated subsequences to the categorization unit 23.

　　It should also be noted that the predetermined algorithm can be modified by feedback from the modification unit 25. If the predetermined algorithm is modified, the computation unit 26 changes the way it computes the intensity signal and/or at least one of the signal-analysis unit 27 and the determination unit 28 changes the way it determines the characteristic point(s). Therefore, the length of the subsequence(s) and/or the position of the subsequence(s) in the video data are corrected to obtain categorization more accurately. This modification process will be explained in detail below.

　　The categorization unit 23 receives the subsequences and categorize them (the partial video data) as human motions. The categorization unit 23 assigns the categorized subsequences to category numbers. Furthermore, the categorization unit 23 can assign the categorized subsequences to text labels by accessing the mapping unit 24 and/or the DB. The subsequences are categorized as cluster(s) of human motions. The categorization unit 23 outputs the subsequences with the category number and the text label for further process(es).

　　Furthermore, to consider single subsequence as a candidate to be categorized, the categorization unit 23 may derive one (or temporarily one or more) categorization solution. The categorization unit 23 perform this process regarding each of the subsequences and the categorization solution is required to categorize the subsequence for the categorization unit 23.

　　The DB functions as a library and stores the categorization solution generated by the categorization unit 23 and the category numbers. The categorization unit 23 can access the DB to use the categorization solution and the category numbers to categorize the subsequences.

　　The mapping unit 24 obtains textual information related to the categorization, such as documents, from databases and/ or the Internet. The mapping unit 24 further obtains textual information especially provided by a user of the categorization system 20. The mapping unit 24 processes the textual information and generates mapping to lexical description to be used to the categorization. The mapping unit 24 may include an input unit and/or network interface, besides a processor and a memory. The categorization unit 23 can access the mapping unit 24 to refer the mapping in order to improve accuracy of the determination of subsequences and categories. In other words, the categorization process done by the categorization unit 23 is assisted by mapping categories to the linguistic domain generated by the mapping unit 24. More specifically, the categorization unit 23 assigns text labels that can be understand by humans to the subsequences and describe the motion pattern as accurate as possible to already identified categories hitherto labeled with a category number. It also assists if the categorization unit 23 cannot determine a category using the text labels of neighboring subsequences. The main purpose to use textual information is to enhance the capabilities of categorization and also to understand the reasoning of the system should there be the need for additional tuning or correcting wrong results of the categorizer.

　　The modification unit 25 determines an evaluation value of a particular categorization solution. The evaluation value of the categorization solution may describe how suited the corresponding categorization is for subsequent processing step(s) after the categorization. An example of the subsequent processing step is an intention detection. The evaluation value may show how suited the corresponding categorization is for predicting actions or events after the human motion denoted by the corresponding subsequence.

　　The evaluation value of the obtained categorization solution can be judged by one or a plurality of indicators. First example of the indicators is how well the categorization solution categorizes (i.e. the categorization unit 23 categorizes) elements that are already known to be of the same category as being part of the same category. Second example of the indicators is an indicator that describes a deviation from the predetermined number of categories for a defined problem. In other words, this indicator shows how much the deviation from an assumed-to-be-known optimal number of categories is for the defined problem. For example, the more unsuitable the obtained categorization solution becomes, the more this indicator becomes. Third example of the indicators is an indicator that describes how well a whole system achieves an overall task by using the categorization solution, while the whole system includes the categorization system 20. This is one of the most important indicators and the most difficult one if it should be used for improving the categorization. An example of the whole system will be described later.

　　The modification unit 25 evaluates the categorization by using at least one of these indicators. However, indicators are not limited to these examples; they may have a variety of parameters to define the correctness or appropriateness of the categorization solution. When the evaluation value of the current categorization solution does not meet a predetermined criterion regarding the indicator(s) (for example, the current categorization solution is not sufficient for the considered task), the modification unit 25 gives appropriate instruction (feedback) to the generation unit 22 to change the predetermined algorithm. In specific, if the modification unit 25 judges that some categorization(s) are not appropriate considering the evaluation of the categorization(s), the modification unit 25 sends the instruction which instructs that a part of the predetermined algorithm corresponding to the categorization(s) should be modified. Based on the instruction, the predetermined algorithm is modified and modification(s) are done to at least one of the ways of computing by the computation unit 26, the way of the analyzation by the signal-analysis unit 27, the way of the determination by the determination unit 28 and the way of the generating of subsequence-generation unit 29. Consequently, the length of the subsequences and the position of the subsequences in the video data can be changed.

　　Next, referring to Figs. 7 to 8B, specific examples of human motion and processes executed by the categorization system 20 will be described. Fig. 7 shows an example intensity signal of the video data. Frame numbers 0 to k_c are shown as a time axis in Fig. 7, and there are two characteristic points of the frame; k_A and k_B. As Fig. 7 shows, at the characteristic points, the intensity of signal has inflection points regarding intensity d_k, in particular, the minima.

　　In the categorization system 20, the computation unit 26 derives the graph in Fig. 7. The signal-analysis unit 27 analyzes the graph, find the two characteristic points k_A and k_B, and regards the two points as the candidate points. Then, the determination unit 28 defines the two points k_A and k_Bas characteristic points. The subsequence-generation unit 29 utilizes two points k_A and k_B determined to decide the length of the subsequences and the position of the subsequences in the video data. In this example, the subsequence-generation unit 29 generates subsequences (1), (2) and (3). The subsequence (1) is set as the one from frame number 0 to k_A, the subsequence (2) is set as the one from frame number k_A to k_Band the subsequence (3) is set as the one from frame number k_B to k_c. As explained above, the subsequences are defined by the two characteristic points k_A and k_B.

　　Fig. 8A shows examples of human motions of each subsequence. As indicated in Fig. 8A, the subsequence (1) shows "raising left arm" of person P, the subsequence (2) shows "passing object" of the person P with the object O and the subsequence (3) shows "relaxing" of the person P. The characteristic body points of these human motions are represented by the intensity signal in Fig. 7.

　　Fig. 8B shows examples of categories and category labels corresponding to the subsequences (1) to (3). The category of the subsequence (1) is "mp31", the category of the subsequence (2) is "mp76" and the category of the subsequence (1) is "mp21". The categorization unit 23 sets these category numbers using the DB. Furthermore, the category label of the subsequence (1) is "raising left arm", the category of the subsequence (2) is "passing object" and the category of the subsequence (1) is "relaxing". The categorization unit 23 sets these category labels using the textual information generated by the mapping unit 24. In this way, the categorization system 20 defines the labels of subsequences.

　　Next, referring to Figs. 9 and 10, an example in which the categorization system 20 does not categorize the subsequences well is described. The graph in Fig. 9 is the same as the one in Fig. 7. However, due to the shortage of information which is clue of finding characteristic points, the categorization system 20 misjudges that false points k_A' and k_B' are characteristic points. As a consequence, the subsequence-generation unit 29 generates subsequences (1)', (2)' and (3)'. The subsequence (1)' is set as the one from frame number 0 to k_A', the subsequence (2)' is set as the one from frame number k_A' to k_B'and the subsequence (3)' is set as the one from frame number k_B' to k_c.

　　Fig. 10 shows examples of categories and category labels corresponding to the subsequences (1)' to (3)'. The categorization unit 23 succeeds to determine the categories and the category labels of the subsequences (1)' and (3)' correctly, however, it fails to determine the category of the subsequences (2)', therefore, the subsequences (2)' is not categorizable for the categorization unit 23. In this case textual reasoning can assist the categorization process and allowing for categorization even if would not be categorizable without this means.

　　Fig. 11 is a schematic view of the feedback process by the modification unit 25 in this situation. The modification unit 25 evaluates the categorization results by using the indicator(s) and sends the feedback to the generation unit 22. The feedback instructs that the predetermined algorithm should be corrected regarding the decision of the characteristic points. Receiving the feedback, the generation unit 22 uses a characteristic point re-evaluation algorithm and adjust the determining of the characteristic points as result of the re-evaluation. Consequently, the generation unit 22 moves the points k_A' and k_B' from their original points as Fig. 9 shows and sets the points at correct positions in Fig. 7.

　　For example, the modification unit 25 may process feature spaces including a pair of

features

1 and 2, and it may categorize the data points to several groups in a different manner. However, it should be noted that the feature spaces are not limited to two dimensions.

　　Fig. 12 shows an example of transition of the number of reasonable categorization solutions. The categorization unit 23 decides what categorization solutions are the reasonable categorization solutions. At the beginning of the time in Fig 12, the number of reasonable categorization solutions is one. The number becomes 2, 3, 2, 1, 2, 1 in a turn with progress at time. To summarize, the number may become temporarily more than one, however, the number converges to one along with the time course. It is preferable that the categorization unit 23 limits the number of categorization solutions to be used in the categorization unit 23 to one, because ambiguity should be decreased for categorizing in order to process processing(s) after the categorization.

　　To detect human motions, determining categories of the human motions may be done in related art. However, following problems may occur when dealing with automatically determining categories by making use of preprocessed data obtained by analysing frames of movies showing a human motion (e.g. carrying out a specific task).

　　A first problem is the problem of deriving meaningful categories describing subsequences by using a minimum of information or no information at all. The categories should make sense from a practical point of view meaning that the obtained categorization is useful for the overall purpose of the technical system. The problem occurs even if the correct length of the subsequence would be known exactly. A valid criterion describing the evaluation value of the categorization solution has to be established.

　　A second problem is the problem of finding valid subsequences that can be mapped to categories and improving subsequence determination over time. The problem occurs since with no information or a low amount of information, it is difficult to derive the length of the single subsequence by solely using the preprocessed data. Furthermore, there is no indication for generating the subsequences.

　　A third problem is the problem of using textual information such as documents obtained from databases, internet or especially provided by the user in order to improve determination of subsequences and categories.

　　The categorization system 20 can solve the aforementioned problems. The first problem is solved by setting the indicators to evaluate the categorization done by the categorization unit 23. In related art, there is an inherent difficulty in the evaluation if low amount of relevant information is available and the categorization system provides input for further processing systems (like intention detection system), since its impact cannot be estimated directly, i.e. how good can the local intention be detected with a certain motion pattern categorization system. However, owing to the introduction of the extended set of indicators, the categorization system 20 can evaluate the properties of the categorization.

　　Moreover, by evaluating of the categorization, the categorization system 20 can modify the predetermined algorithm to change the way of generating the subsequences (the way of choosing the candidate points), if necessary. In other words, based on the evaluation value of the obtained categorization solution, the computation of the subsequence length may be adapted, for example, by changing the way the intensity signal is computed and the characteristics points are determined (e.g. by a rule base). Accordingly, the categorization solution is modified corresponding to the modification by the categorization unit 23 and the modified categorization solution is stored in the DB.

　　The second problem is solved by computing an intensity signal in a certain, adaptive way, deriving characteristic points of this signal for determining the relevant length of the subsequences based on the predetermined adaptable algorithm.

　　The third problem is solved by introducing of the DB and the mapping unit 24 to the categorization system 20. These units enable the categorization system 20 to generate categories and category labels by using appropriate numbers and text information. Especially, the mapping unit 24 can generate the mapping information by obtaining information about human motions from databases and/or the Internet, the categorization unit 23 can utilize the mapping information to improve the accuracy of the categorization.

　　The categories can be learnt automatically by the categorization system 20, and even when new motion subsequences are executed, a new category of the new motion can be determined with a modest need for data.

　　As explained above, the categorization system 20 can modify the predetermined algorithm based on the evaluation of the categorization executed by the categorization unit 23. Therefore, the categorization system 20 can categorize the subsequences more accurately.

　　Moreover, the preprocessor unit 21 can reduce information contained in the raw video data and generate the video data containing information related to the categorization. As a result, the processes related to the categorization can be done with less processing time and the accuracy of the categorization can be increased.

　　Furthermore, the modification unit 25 can evaluate the categorization using at least one of following indicators: an indicator that describes how well the categorization means categorizes elements that are already known to be of the same category as being part of the same category, an indicator that describes a deviation from the predetermined number of categories for a defined problem, and an indicator that describes how well a system achieves an overall task, while the system includes the categorization apparatus. Therefore, the categorization system 20 can evaluate the categorization practically.

　　Further, the categorization unit 23 can categorize the subsequence (partial video data) as a kind of human motion. As a consequence, the categorization system 20 can be used for detecting human motions.

　　Especially, the generation unit 22 can compute the intensity signal of the video data to determine the certain time region, while the intensity signal denotes motion of a person. As the feature of human motions can be defined as a simple intensity signal, consequently, the generation unit 22 can grasp the feature of human motions easily.

　　Moreover, the categorization unit 23 can assign the categorized subsequences (partial video data) to the text label. Therefore, users of the categorization system 20 can recognize the results of the categorization with ease.

　　(Fourth Example Embodiment)
　　A fourth example embodiment of the disclosure is explained below with reference to the drawing.

　　Fig. 13 shows an intention detection system 30. The intention detection system 30 includes the units of the categorization system 20, a human-object analysis unit 31 and an intention detection unit 32. To summarize, the intention detection system 30 is a system coupled with an intention detection inference module. Since the processes of the units from the preprocessor unit 21 to the modification unit 25 are the same as the ones explained in the third example embodiment, the description thereof is omitted. The human-object analysis unit 31 and the intention detection unit 32 correspond to one example of the recognition unit 15 in the Second Example Embodiment.

　　The human-object analysis unit 31 analyzes the video data input by the preprocessor unit 21 and the subsequences generated by the generation unit 22 to detect various kinds of human parts in the subsequences. The human part to be detected is for example, a head, a right or left arm, a right or left leg or the like. Preferably the human-object analysis unit 31 can detect the parts to be used for gesture indicating instructions. The human-object analysis unit 31 outputs the result of the detection to the categorization unit 23. The categorization unit 23 utilizes the detection result to categorize the subsequences to improve the accuracy of the categorization.

　　The intention detection unit 32 receives the result of the categorization from the categorization unit 23 and utilizes it to detect intention of the person in the video data. In this disclosure, "intention" can represent operations to a certain object. The operations, for example, can include an operation that the certain object is grasped, an operation that the certain object is put down, and the like. If the intention detection system 30 is located in a factory, the intention detection unit 32 can detect the intention of a worker (e.g. "indicating wish to grasp a certain object", "indicating wish to be followed", "indicating wish to put down object"). Furthermore, "intention" can also represent instruction for machine movement. The machine movement, for example, can include traveling, operation of a part of the machine or stop of these operations. The intention detection unit 32 outputs the result of the intention detection. Examples of the output are the inferred person's activity and/or gesture regarding the subsequence to be analyzed. Further, the intention detection unit 32 may predict the next activity and/or gesture of the person and output the prediction.

　　In this case, the intention detection unit 32 can detect human intention by using the categorized subsequences (partial video data). Consequently, the intention detection system 30 can be applied to the supporting system for human activities in the various fields, e.g., industrial and/or medical fields.

　　(Fifth Example Embodiment)
　　A fifth example embodiment of the disclosure is explained below with reference to the drawings. This example embodiment describes a specific application of the intention detection system 30.

　　Fig. 14 shows a machine which includes the intention detection system 30. Specifically, the machine 40 includes the intention detection system 30, a sensor S, a signal generator 41 and an optimizer controller 42. Since the processes of the intention detection system 30 are the same as the ones explained in the fourth example embodiment, the description thereof is omitted. One example of the machine 40 is a robot.

　　The sensor S obtains the raw video data and input it to the preprocessor unit 21 in the intention detection system 30. The sensor S, for example, may be video sensor.

　　The signal generator 41 receives the output from the intention detection unit 32 in the intention detection system 30 and generates control signals to control movements of the machine 40 considering also the output of the intention detection unit 32.For example, the signal generator 41 can determines a motion of the machine 40 depending on the operation determined by the intention detection unit 32 and controls the machine 40 in accordance with the determined operation. The signal generator 41 may receive other input signals from other sensor(s) and/or parts of the machine as shown in Fig. 14, and it may also generate the control signals considering other input signals. The signal generator 41 performs as a controller of the machine 40. For example, if the machine can move on the ground, the signal generator 41 can perform as a trajectory planner and generate the control signal for the movement along with the planned trajectory. Further, the signal generator 41 can receive signals from parts of the machine 40 and generate a reference signal to control the parts. The signal generator 41 outputs the generated signals to the optimizer controller 42. The optimizer controller 42 receives the control signals and processes the control signals as optimizer. That is how the machine 40 plans and controls its movement.

　　Fig. 15 shows a specific application of the machine 40, a picking robot. The picking robot R includes the intention detection system 30 inside itself, an absorb mechanism AM and a storage space. The absorb mechanism AM absorbs items and the absorbed items are stored in the storage space corresponding to internal control of the picking robot R.

　　Figs. 16A and 16B show example processes of the picking robot R instructed by human gesture. Figs. 16A and 16B show the situation that, in a warehouse or a factory, a worker W wants to direct and give command to the picking robot R. The picking robot R can monitor the worker W and obtain the video data to recognize the worker's gesture. Through the processes explained in the third and fourth example embodiment, the picking robot R categorizes the worker's gesture and detects the intention of the worker based on the categorization. Using the detection result of the intention, the picking robot R can carry out desired operation. The picking robot R may memorize correspondence relation between the detected gesture (i.e. instructions) of the worker W and the operations to be performed by the picking robot R. Detecting gesture, the picking robot R may perform the desired operation based on the memorized correspondence relation.

　　For example, in Fig. 16A, the worker W stretches his or her right arm toward the shelf S. Fig 16A also shows that there are many different items on the shelf S. Before the gesture of the worker W, the picking robot R does not operate to collect items on the shelf S. However, when the worker W perform the gesture, the picking robot R categorizes this gesture of the worker W and determines that this gesture corresponds to a process of absorbing items on the shelf S. Then, the signal generator 41 in the picking robot R generates the control signals to move the picking robot R to the position nearby the shelf S and make the absorb mechanism AM absorb the items on the shelf S to collect them.

　　As another example, in Fig. 16B, the worker W moves his or her left arm from the right side to the left side in Fig 16B. The picking robot R categorizes this gesture of the worker W and determines that this gesture corresponds to a process of stopping its operation and leaving from the shelf S. Then, the signal generator 41 in the picking robot R generates the control signals to do these movements.

　　In related art, markers are often necessary for instructing for machines, although it may be bothersome to attaching the markers to people. However, this disclosure discloses the advanced machine learning system applicable to various machines and can provide "no marker solution". Therefore, the burden of attaching the markers to people can be avoided.

　　Further, the signal generator 41 (a controller) controls the movement of the machine 40 based on the human intention detected by the intention detection unit 32. Therefore, the machine 40 can support the worker's work.

　　It should be noted that the present invention is not limited to the above-described embodiment, and may be modified as appropriate without departing from the spirit of the invention. For example, instead of the modification unit 25, another unit in the categorization system 20 or the device outside the categorization system 20 may evaluate the categorization done by the categorization unit 23.

　　A plurality of the partial video data (or sub-sequences) to be generated may overlap each other regarding time in the first and third example embodiment, because different human motions can be done by overlapping their time.

　　In Fig. 8A, the examples of "raising left arm", "passing object" and "relaxing" of the person P are shown. It goes without saying that, however, examples of human motions are not limited to these; for example, "raising left arm in vicinity of object", "raising right arm", "pointing with forefinger", "make special gesture with hand" or the like can be human motions to be detected.

　　The present disclosure could be applied to applications where the main information of a data frame can be summarized with a certain low number of somehow related points in a 2 or 3-dimensional space that change their position in this space and the image of these points is given at certain time-steps.

　　The present disclosure relates to a categorization system for various purposes, method and program which is able to categorize motion patterns obtained from point data, which is calculated from a sequence of regularly or irregularly sampled movie frames. This technical system is useful for determining the motion patterns of an acting person and categorizing them accordingly. It may be applied to an intention detection system where the correctly categorized and labeled motion subsequences play an important role for further processing, for example, planning assistance for humans. Specifically, it can be used in a variety of situations such as a factory, shopping mall, warehouse, canteen kitchen or construction site. Furthermore, it can be used to analyze human motion in activities related to sport or other activities. It is also applicable to characterize dynamic patterns very generally. However, the application of the disclosure is not necessarily limited to this field.

　　Next, a hard configuration example of the device explained in the above-described plurality of embodiments is explained hereinafter with reference to Fig. 17.

　　Fig. 17 is a block diagram showing a configuration example of the information processing apparatus. As shown in Fig. 17, the information processing apparatus 90 includes a network interface 91, a processor 92, and a memory 93. The network interface 91 can transmit and receive data to and from other devices by wireless communication.

　　The processor 92 performs processes performed by the information processing apparatus 90 explained with reference to the sequence diagrams and the flowcharts in the above-described embodiments by loading software (a computer program) from the memory 93 and executing the loaded software. The processor 92 may be, for example, a microprocessor, an MPU (Micro Processing Unit), or a CPU (Central Processing Unit). The processor 92 may include a plurality of processors.

　　The memory 93 is formed by a combination of a volatile memory and a nonvolatile memory. The memory 93 may include a storage disposed apart from the processor 92. In this case, the processor 92 may access the memory 93 through an I/O interface (not shown).

　　In the example shown in Fig. 17, the memory 93 is used to store a group of software modules. The processor 92 can perform processes performed by the information processing apparatus explained in the above-described embodiments by reading the group of software modules from the memory 93 and executing the read software modules.

　　As explained above with reference to Fig. 17, each of the processors included in the information processing apparatus in the above-described embodiments executes one or a plurality of programs including a group of instructions to cause a computer to perform an algorithm explained above with reference to the drawings.

　　Furthermore, the information processing apparatus 90 may include the network interface. The network interface is used for communication with other network node apparatuses forming a communication system. The network interface may include, for example, a network interface card (NIC) in conformity with IEEE 802.3 series. The information processing apparatus 90 may receive the Input Feature Maps or send the Output Feature Maps using the network interface.

　　In the above-described examples, the program can be stored and provided to a computer using any type of non-transitory computer readable media. Non-transitory computer readable media include any type of tangible storage media. Examples of non-transitory computer readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g. magneto-optical disks), CD-ROM (compact disc read only memory), CD-R (compact disc recordable), CD-R/W (compact disc rewritable), and semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, RAM (random access memory), etc.). The program may be provided to a computer using any type of transitory computer readable media. Examples of transitory computer readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer readable media can provide the program to a computer via a wired communication line (e.g. electric wires, and optical fibers) or a wireless communication line.

　　Part of or all the foregoing embodiments can be described as in the following appendixes, but the present disclosure is not limited thereto.
　　(Supplementary Note 1)
　　A categorization apparatus comprising:
　　a generation means for determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region;
　　a categorization means for categorizing the partial video data generated by the generation means; and
　　a modification means for modifying the predetermined algorithm based on evaluation of the categorization executed by the categorization means.
　　(Supplementary Note 2)
　　The categorization apparatus according to Supplementary Note 1, further comprising:
　　a preprocessor means for reducing information contained in raw data and generating the video data containing information related to the categorization.
　　(Supplementary Note 3)
　　The categorization apparatus according to

Supplementary Note

1 or 2, wherein
　　the modification means evaluates the categorization by using at least one of following indicators: an indicator that describes how well the categorization means categorizes elements that are already known to be of the same category as being part of the same category, an indicator that describes a deviation from the predetermined number of categories for a defined problem, and an indicator that describes how well a system achieves an overall task, while the system includes the categorization apparatus.
　　(Supplementary Note 4)
　　The categorization apparatus according to any one of Supplementary Notes 1 to 3, wherein
　　the categorization means categorizes the partial video data as a kind of human motion.
　　(Supplementary Note 5)
　　The categorization apparatus according to Supplementary Note 4, wherein
　　the generation means computes an intensity signal of the video data to determine the certain time region, while the intensity signal denotes motion of a person.
　　(Supplementary Note 6)
　　The categorization apparatus according to

Supplementary Note

4 or 5, wherein
　　the categorization means assigns the categorized partial video data to a text label.
　　(Supplementary Note 7)
　　The categorization apparatus according to any one of Supplementary Notes 4 to 6, further comprising:
　　an intention detection means for detecting human intention by using the categorized partial video data.
　　(Supplementary Note 8)
　　The categorization apparatus according to Supplementary Note 7, further comprising:
　　a controller for controlling a movement of a machine based on the human intention detected by the intention detection means.
　　(Supplementary Note 9)
　　A control device comprising:
　　a recognition means for recognizing video data containing an operation and, thereby, determining the operation; and
　　a controller for determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.
　　(Supplementary Note 10)
　　The control device according to Supplementary Note 9, further comprising:
　　a categorization means for categorizing the video data and input the categorized video data to the recognition means.
　　(Supplementary Note 11)
　　The control device according to Supplementary Note 10, further comprising:
　　a generation means for determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region, wherein the partial video data is recognized by the recognition means; and
　　a modification means for modifying the predetermined algorithm based on evaluation of the categorization executed by the categorization means.
　　(Supplementary Note 12)
　　The control device according to Supplementary Note 11, wherein
　　the modification means evaluates the categorization by using at least one of following indicators: an indicator that describes how well the categorization means categorizes elements that are already known to be of the same category as being part of the same category, an indicator that describes a deviation from the predetermined number of categories for a defined problem, and an indicator that describes how well a system achieves an overall task, while the system includes the categorization apparatus.
　　(Supplementary Note 13)
　　The control device according to any one of Supplementary Notes 10 to 12, wherein
　　the categorization means categorizes the video data as a kind of human motion.
　　(Supplementary Note 14)
　　A categorization method comprising:
　　determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region;
　　categorizing the partial video data; and
　　modifying the predetermined algorithm based on evaluation of the categorization.
　　(Supplementary Note 15)
　　　　Control method comprising:
　　recognizing video data containing an operation and, thereby, determining the operation; and
　　determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.
　　(Supplementary Note 16)
　　A non-transitory computer readable medium storing a program for causing a computer to execute:
　　determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region;
　　categorizing the partial video data; and
　　modifying the predetermined algorithm based on evaluation of the categorization.
　　(Supplementary Note 17)
　　A non-transitory computer readable medium storing a program for causing a computer to execute:
　　recognizing video data containing an operation and, thereby, determining the operation; and
　　determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.

　　It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present disclosure as shown in the specific embodiments without departing from the spirit or scope of the disclosure as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.

10　　categorization apparatus
11　　generation unit
12　　categorization unit
13　　modification unit
14　　control device
15　　recognition unit
16　　controller
20　　categorization system
21　　preprocessor unit
22　　generation unit
23　　categorization unit
24　　mapping unit
25　　modification unit
26　　computation unit
27　　signal-analysis unit
28　　determination unit
29　　subsequence-generation unit
30　　intention detection system
31　　human-object analysis unit
32　　intention detection unit
40　　machine
41　　signal generator
42　　optimizer controller

Claims

　　A categorization apparatus comprising:
　　a generation means for determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region;
　　a categorization means for categorizing the partial video data generated by the generation means; and
　　a modification means for modifying the predetermined algorithm based on evaluation of the categorization executed by the categorization means.
　　The categorization apparatus according to Claim 1, further comprising:
　　a preprocessor means for reducing information contained in raw data and generating the video data containing information related to the categorization.
　　The categorization apparatus according to Claim 1 or 2, wherein
　　the modification means evaluates the categorization by using at least one of following indicators: an indicator that describes how well the categorization means categorizes elements that are already known to be of the same category as being part of the same category, an indicator that describes a deviation from the predetermined number of categories for a defined problem, and an indicator that describes how well a system achieves an overall task, while the system includes the categorization apparatus.
　　The categorization apparatus according to any one of Claims 1 to 3, wherein
　　the categorization means categorizes the partial video data as a kind of human motion.
　　The categorization apparatus according to Claim 4, wherein
　　the generation means computes an intensity signal of the video data to determine the certain time region, while the intensity signal denotes motion of a person.
　　The categorization apparatus according to Claim 4 or 5, wherein
　　the categorization means assigns the categorized partial video data to a text label.
　　The categorization apparatus according to any one of Claims 4 to 6, further comprising:
　　an intention detection means for detecting human intention by using the categorized partial video data.
　　The categorization apparatus according to Claim 7, further comprising:
　　a controller for controlling a movement of a machine based on the human intention detected by the intention detection means.
　　A control device comprising:
　　a recognition means for recognizing video data containing an operation and, thereby, determining the operation; and
　　a controller for determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.
　　The control device according to Claim 9, further comprising:
　　a categorization means for categorizing the video data and input the categorized video data to the recognition means.
　　The control device according to Claim 10, further comprising:
　　a generation means for determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region, wherein the partial video data is recognized by the recognition means; and
　　a modification means for modifying the predetermined algorithm based on evaluation of the categorization executed by the categorization means.
　　The control device according to Claim 11, wherein
　　the modification means evaluates the categorization by using at least one of following indicators: an indicator that describes how well the categorization means categorizes elements that are already known to be of the same category as being part of the same category, an indicator that describes a deviation from the predetermined number of categories for a defined problem, and an indicator that describes how well a system achieves an overall task, while the system includes the control device.
　　The control device according to any one of Claims 10 to 12, wherein
　　the categorization means categorizes the video data as a kind of human motion.
　　A categorization method comprising:
　　determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region;
　　categorizing the partial video data; and
　　modifying the predetermined algorithm based on evaluation of the categorization.
　　A control method comprising:
　　recognizing video data containing an operation and, thereby, determining the operation; and
　　determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.
　　A non-transitory computer readable medium storing a program for causing a computer to execute:
　　determining a certain time region of video data based on predetermined algorithm and generating partial video data from which the video data is extracted in the certain time region;
　　categorizing the partial video data; and
　　modifying the predetermined algorithm based on evaluation of the categorization.
　　A non-transitory computer readable medium storing a program for causing a computer to execute:
　　recognizing video data containing an operation and, thereby, determining the operation; and
　　determining a motion of a machine depending on the determined operation and controlling the machine in accordance with the determined operation.