PRISM: A Rich Class of Parameterized Submodular Information Measures for Targeted Data Subset Selection


With ever-increasing dataset sizes, subset selection techniques are becoming increasingly important. It is often desired that these subsets align with a target. For example, a subset of unlabeled points may be sought to augment training data for improving a model’s accuracy on a targeted slice of data. As another example, a summary of a video or text or image collection may be needed relative to a given query or private set. Motivated by such applications, we present PRISM, a rich class of PaRameterIzed Submodular information Measures. Through novel functions and their parameterizations, PRISM offers a variety of modeling capabilities, such as, trading off between query-relevance or the strictness of the dissimilarity from private set on one hand and diversity or coverage on the other. This makes it possible to cater to different characteristics desired of the targeted subsets. We present how PRISM can be applied to the above two example tasks of targeted subset selection and show that in doing so, PRISM interestingly generalizes some of the past work therein, reinforcing its broad utility. Through extensive experiments on real-world datasets, we demonstrate the superiority of PRISM over the state-of-the-art in improving an image-classifier’s accuracy and in image-collection summarization.

Submitted to NeurIPS 2021
comments powered by Disqus