ML / NLP interviews

This blog compiles some points that might be encountered in ML / NLP interviews. Good luck (for myself and to all readers finding jobs).

  • Statistics
  • Probability
  • Causality, Fair ML
  • Models
  • Neural Network
  • Training
  • Natural Languages Processing specific
    • NLP Basics
    • Information theory
    • Embedding
    • Parsing


  • Evaluating a model. TP / FP / TN / FN
  • Type 1 (false positive) and type 2 errors (false negative).
  • Precision (TP / (TP + FP)) vs. Recall (TP / (TP + FN)).
  • Sensitivity / specificity.
  • F1 score definitions. Micro / macro / weighted.
  • ROC: receiver operating characteristic) curve. True positive rate vs. false positive rate at various threshold settings.
  • AUC: area under (the ROC) curve.
  • Bias-variance tradeoff. How to decide whether the model is overfitting, looking from learning plot?
    • High bias: overfit. Solve with more training.
    • High variance: too flexible; low generalizability. Solve with (1) regularization (2) bagging algorithm (3) using less and more simple features.
  • LOOCV, k-fold. They are designed to counteract the bias!


  • Bayes Rule: posterior = likelihood * prior / evidence.
  • How to compute evidence? Sum them up / integral / etc.
  • How to compute posterior when it is almost impossible to do the above steps? Variational inference (to maximize the variational lower bound) or Monte Carlo sampling (e.g., Metropolis-Hastings algorithm, Gibbs sampling).
  • Probability distribution functions (pdf).
  • Exponential family distribution functions. Conjugate priors (Beta prior + binomial distribution; Dirichlet prior + multinomial distribution)
  • Mutual information
  • Conditional independence $P(X|Y) = P(Y)$
  • Bayesian Network, d-separation.
  • Variable elimination.

Causality and Fair Learning

  • Randomized control trials.
  • Rubin-Neyman causal model.
  • Traditional deconfounding approaches: residualization, inverse probability weighting.
  • Fair Learning: unfairness metrics (demographics parity, equalized opportunity)
  • Direct optimize unfairness: (Zemel et al., 2013)
  • Adversarial components: (Madras et al., 2018)


  • Supervised learning models: SVM, Random Forest, Gaussian Process, LogReg, Adaboost, Naive Bayes.
  • Unsupervised: kNN, clustering, domain adaptation, self-supervised embedding.
  • Visualization: PCA, T-SNE.
  • Regression vs classification.
  • Kernel method. Wikipedia
  • Decision trees are suitable for nonlinear data Q7.
  • Explain boosting, gradient boosting, GBDT.
  • CRF is discriminative (measures conditional probability) whereas HMM is generative (measures joint probability).

Neural Network

  • Long Short Term Memory (LSTM)
    First there are three gates: forget, information, and output.Then there’s a G gate updating the cell states. For output we have:
  • Transformer (Vaswani et al., 2017) Key, Query, Value “gates” per head, multi-head attention, positional encoding.
  • Activations. ReLU vs. Sigmoid vs. Tanh.
  • Gradient explosion / gradient vanishing. How to address them? (Use ReLU; LSTM gating + gradient clipping etc.)
  • Batch normalization (Ioffe and Szegedy, 2015). Purpose is to address internal covariance shift. Need an additional “scale and shift back” step.


  • Cross-entropy loss or L2 loss?
  • Dropout. Explain them.
  • Gradient optimizer: GD, SGD.
  • Adagrad: adjust the gradients by . A previous large gradient corresponds to a smaller step.
  • Momentum method: use gradients to change the velocity of the weights changing.
    • Nesterov momentum: First jump, then measure gradients, and then make a correction.
    • RMSprop: divide the gradient by a running average of its recent magnitude.
    • Adam optimizer: a combination of Adagrad and RMSprop.
    • See Hinton’s slides for more explainations.
  • Regularization. L1 / L2. L1 (“Least Absolute Shrinkage and Selection Operator”, LASSO) corresponds to a Laplacian prior, with lots of weights around 0, hence encourage sparcity, hence are regarded as having built-in feature selection mechanism. L2 corresponds to a Gaussian prior.
  • Handling imbalance classes.
  • Handling missing data.

Open questions

  • What’s a favourite paper you recently read?
  • What do you plan to do? (short-term / long-term research goals)


  • Techniques for keyword normalization? Lemmatization and stemming.
  • Techniques for string matching? Levenshtein (i.e., edit distance), soundex, and metaphone.
  • Regular Expression (examples in Python).
    . (single dot) matches any character.
    .* matches any character sequences of any length.
    [0-9] matches any one digit.
    [a-z] matches any one lower-case alphabet.
    Brackets need to be escaped. e.g., \(, \], if you want to match them.
    \ need to be specially taken care of, in Python.
  • N-gram: Continuous N words as a bag.
  • Zipf’s law. In a corpus, the frequency of a word is inversely proportional to its rank (by frequency).

Information theory

  • Entropy
  • KL divergence
  • Jensen-Shannon divergence

NLP-Feature extractions and embeddings

  • TF-IDF of the term = where term frequency is the term’s frequency, and document frequency is the freq of documents containing the term.
  • Popular word embedding methods? GloVe, Word2Vec, FastText, ELMo.
    • GloVe (Pennington et al., 2014): a global log-bilinear regression model trained on word-word co-occurrence counts.
    • Word2Vec (Mikolov et al., 2013): Basically a two-layer networks predicting the context of a word with (skip-gram), or predict the projected word given its contexts with (CBOW). The output is written as in the above equations. Use the first layer output as word embedding.
    • FastText (Facebook AI Research): An optimized implementation of Word2Vec, but incorporated sub-word information (e.g., including word components of lengths from 3 to 5 as)
    • ELMo (Peters et al. 2018): Use intermediate layers outputs of Bi-LSTMs in additional to word embeddings as new embeddings.
    • BERT (Devlin et al. 2018): Mask 15% of tokens. Predict them based on their contexts using Transformer.
  • Topic modeling: LDA.


  • Shift-reduce parser.
  • Parsing: dependency vs. constituency parsing.


  • Word sense disambiguation example: Lesk algorithm. Compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood.
  • Recommendation system: HITS, PageRank, collaborative filtering.