ACL 2020 runs virtual. Thanks to the format, I was able to listen to many talk sessions that would otherwise be held in parallel, and was able to observe some trends from a collection of ~150 papers I took notes about. This blog presents these trends, and the note is attached below.
- Deep learning is at the center
- Dialog gets exponentially more looks
- NLP impacts the society
- Linguistics and cognitive psychology needs attentions
- We need to understand what happened in NLP
I saw many papers about Transformers with structural improvements. There are various ML methods applied to NLP systems. E.g., VAE, dropout, consistency loss, regularization, curriculum learning, and analysis methods inspired by information theory.
The number of publications relevant to dialog grows from 20 (2018) and 40-ish (2019) to 80-ish (2020). While there are exponential growth of papers, they are far from settling down agreements on e.g., what are the qualities of good dialog systems.
Note that similar critiques apply to evaluation methods of generated dialog as much as the generated translations. A best paper runner-up, (Mathur et al., 2020), mentions the impacts of e.g., outliers and heteroskedasticity on common evaluation metrics of machine translation systems.
With that being said, there are already many metrics, including the expensive human evaluation. Many papers doing dialogue systems evaluate on the existing metrics.
Making the society a better place is one of the more important goals in areas with wide applications in the industry (e.g., sentiment classification, text mining and question-answering). The impacts to society should be never forgotten by researchers.
Abusive languages / hate speeches
This is a direct direction where NLP researchers can improve the status of the society. For example, the corpus collected from social media and online forums, if not carefully preprocessed, could be the source of hate speech of our next chatbot. There are papers trying to detect implicit hate speech, with e.g., by reasoning and fine-grained classification. We could even let bots moderate the environments online, by e.g., generating counter narratives against online hate speech.
Fairness and bias in NLP systems
Gender gap exists in NLP research, although this gap is much smaller in NLP than the average AI fields. There are increasing papers discussing bias recently, but the conceptualizations should be analyzed more carefully. For example, the motivations and quantitative techniques should be well-matched.
There are still a long way to go in natural languages understanding. Blindly training on datasets without taking care of semantics could be compared to “trying to learn Java by letting a model to study on all github Java codes, without telling the model the semantics of Java” (Bender and Koller, 2020) To understand the meanings, construal may be the key.
Directly solving the “natural languages understanding / modeling” problem is like trying to prove the holy grail problems all at one step. Aside from numerous works on the syntactic (e.g., this, this, this, this, this, and this paper) and semantic (example) aspects of neural models, there are works connecting the dots from linguistics to NLP. For example, code switching, adjective ordering, compositionality, declension class.
This paper and this talk presented an argument about the “bottom-up” vs the “top-down” development of theories. A bottom-up development — e.g., making certain points of improvements in some metrics — is intrinsically unfalsifiable, whereas theories presented in a top-down fashion, while could be falsified in the future, are also important. They presented in the context of semantics, but I think similar arguments applies to other areas.
A theme track is introduced in ACL ‘20. Revisiting the development of NLP in the previous years helps to understand what happens with the field and to expect future directions.
In terms of citations for publication citations, there are significant differences between top-tier and other conferences. It is also useful to estimate the citations. We might forget to cite older papers, as there are exponentially more papers recently.
A question more direct than the citation statistics is the evaluation of the NLP methods themselves. Two following papers here got outstanding paper (or best paper) awards, but there are more discussions in many other papers.
The first paper (Linzen, 2020) critiques the “pretraining-agnostic identically distributed” paradigm. Evaluation systems should reward models with good generalization properties; not only those with high performances.
The second paper (Ribeiro et al., 2020) proposes a multidimensional Checklist to evaluate the abilities of models. The Checklist constitutes a table, with each row presenting a linguistic capability (e.g., POS, logic, robustness, temporal negation, coreference, …, as can be applied to the task), and four columns presenting four conditions (minimum functionality test, perturbation test, invariance test, and directional expectation tests). The authors provide a toolkit to test all these aspects, and showed that many SOTA models or those commercial, stress-tested models, fail on some entries in this Checklist.