# ML / NLP Interviews

This blog compiles some points that might be encountered in ML / NLP interviews. Good luck (for myself and to all readers finding jobs).

• Statistics
• Probability
• Causality, Fair ML
• Models
• Neural Network
• Training
• Natural Languages Processing specific
• NLP Basics
• Information theory
• Embedding
• Parsing

## Statistics

• Evaluating a model. TP / FP / TN / FN
• Type 1 (false positive) and type 2 errors (false negative).
• Precision (TP / (TP + FP)) vs. Recall (TP / (TP + FN)).
• Sensitivity / specificity.
• F1 score definitions. Micro / macro / weighted.
• ROC: receiver operating characteristic) curve. True positive rate vs. false positive rate at various threshold settings.
• AUC: area under (the ROC) curve.
• Bias-variance tradeoff. How to decide whether the model is overfitting, looking from learning plot?
• High bias: overfit. Solve with more training.
• High variance: too flexible; low generalizability. Solve with (1) regularization (2) bagging algorithm (3) using less and more simple features.
• LOOCV, k-fold. They are designed to counteract the bias!

## Probability

• Bayes Rule: posterior = likelihood * prior / evidence.
• How to compute evidence? Sum them up / integral / etc.
• How to compute posterior when it is almost impossible to do the above steps? Variational inference (to maximize the variational lower bound) or Monte Carlo sampling (e.g., Metropolis-Hastings algorithm, Gibbs sampling).
• Probability distribution functions (pdf).
• Exponential family distribution functions. Conjugate priors (Beta prior + binomial distribution; Dirichlet prior + multinomial distribution)
• Mutual information $I(X, Y) = \sum_X \sum_Y P(X, Y) log\frac{P(X, Y)}{P(X)P(Y)}$
• Conditional independence $P(X|Y) = P(Y)$
• Bayesian Network, d-separation.
• Variable elimination.

## Causality and Fair Learning

• Randomized control trials.
• Rubin-Neyman causal model.
• Traditional deconfounding approaches: residualization, inverse probability weighting.
• Fair Learning: unfairness metrics (demographics parity, equalized opportunity)
• Direct optimize unfairness: (Zemel et al., 2013)

## Models

• Supervised learning models: SVM, Random Forest, Gaussian Process, LogReg, Adaboost, Naive Bayes.
• Unsupervised: kNN, clustering, domain adaptation, self-supervised embedding.
• Visualization: PCA, T-SNE.
• Regression vs classification.
• Kernel method. Wikipedia
• Decision trees are suitable for nonlinear data Q7.
• Explain boosting, gradient boosting, GBDT.
• CRF is discriminative (measures conditional probability) whereas HMM is generative (measures joint probability).

## Neural Network

• Long Short Term Memory (LSTM)
First there are three gates: forget, information, and output.Then there’s a G gate updating the cell states. For output we have: $h_t = o_t * \eta(C_t)$
• Transformer (Vaswani et al., 2017) Key, Query, Value “gates” per head, multi-head attention, positional encoding.
• Activations. ReLU vs. Sigmoid vs. Tanh.
• Batch normalization (Ioffe and Szegedy, 2015). Purpose is to address internal covariance shift. Need an additional “scale and shift back” step.

## Training

• Cross-entropy loss or L2 loss?
• Dropout. Explain them.
• Adagrad: adjust the gradients by $\frac{1}{diag(\sum grad*grad^T)}$. A previous large gradient corresponds to a smaller step.
• Momentum method: use gradients to change the velocity of the weights changing.
• Nesterov momentum: First jump, then measure gradients, and then make a correction.
• RMSprop: divide the gradient by a running average of its recent magnitude.
• See Hinton’s slides for more explainations.
• Regularization. L1 / L2. L1 (“Least Absolute Shrinkage and Selection Operator”, LASSO) corresponds to a Laplacian prior, with lots of weights around 0, hence encourage sparcity, hence are regarded as having built-in feature selection mechanism. L2 corresponds to a Gaussian prior.
• Handling imbalance classes.
• Handling missing data.

## Open questions

• What’s a favourite paper you recently read?
• What do you plan to do? (short-term / long-term research goals)

## NLP-basics

• Techniques for keyword normalization? Lemmatization and stemming.
• Techniques for string matching? Levenshtein (i.e., edit distance), soundex, and metaphone.
• Regular Expression (examples in Python).
. (single dot) matches any character.
.* matches any character sequences of any length.
[0-9] matches any one digit.
[a-z] matches any one lower-case alphabet.
Brackets need to be escaped. e.g., \(, \], if you want to match them.
\ need to be specially taken care of, in Python.
• N-gram: Continuous N words as a bag.
• Zipf’s law. In a corpus, the frequency of a word is inversely proportional to its rank (by frequency).

## Information theory

• Entropy $H=-\Sigma_i log(p_i)$
• KL divergence
• Jensen-Shannon divergence

## NLP-Feature extractions and embeddings

• TF-IDF of the term = $TF * log \frac{1}{DF}$ where term frequency is the term’s frequency, and document frequency is the freq of documents containing the term.
• Popular word embedding methods? GloVe, Word2Vec, FastText, ELMo.
• GloVe (Pennington et al., 2014): a global log-bilinear regression model trained on word-word co-occurrence counts.
• Word2Vec (Mikolov et al., 2013): Basically a two-layer networks predicting the context of a word with $P=\text{softmax}(w'^T (\sum w_c))$ (skip-gram), or predict the projected word given its contexts with $P=\sum \text{softmax}(w'^T w_c)$(CBOW). The output is written as $w'$ in the above equations. Use the first layer output as word embedding.
• FastText (Facebook AI Research): An optimized implementation of Word2Vec, but incorporated sub-word information (e.g., including word components of lengths from 3 to 5 as)
• ELMo (Peters et al. 2018): Use intermediate layers outputs of Bi-LSTMs in additional to word embeddings as new embeddings.
• BERT (Devlin et al. 2018): Mask 15% of tokens. Predict them based on their contexts using Transformer.
• Topic modeling: LDA.

## NLP-Parsing

• Shift-reduce parser.
• Parsing: dependency vs. constituency parsing.

## NLP-Misc

• Word sense disambiguation example: Lesk algorithm. Compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood.
• Recommendation system: HITS, PageRank, collaborative filtering.