In this paper, we estimate the task-specific information in text-based classification datasets.
Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the “shortcuts” inherent in the datasets, do not contribute to the task-specific information (TSI) of the classification tasks. While it is essential to look at the model performance, it is also important to understand the datasets. In this paper, we consider this question: Apart from the information introduced by the shortcut features, how much task-specific information is required to classify a dataset? We formulate this quantity in an information-theoretic framework. While this quantity is hard to compute, we approximate it with a fast and stable method. TSI quantifies the amount of linguistic knowledge modulo a set of predefined shortcuts — that contributes to classifying a sample from each dataset. This framework allows us to compare across datasets, saying that, apart from a set of “shortcut features”, classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.
Here is a slide deck at UT Language Research Day (2021-11-12):
We need to understand the dataset more. Here's an angle by task-specific information (TSI): https://t.co/zQi4H7zeUt, w/ @AparnaBee @MarzyehGhassemi @SPOClab On average, how much information does the non-shortcut portion of a dataset contribute?— Zining Zhu (@zhuzining) October 20, 2021