Correspondence Analysis
Definition
Correspondence analysis is a dimension-reduction method for contingency tables.
It is similar in spirit to PCA, but it uses chi-square geometry instead of ordinary Euclidean geometry on raw numeric variables.
Contingency Table
Let $N$ be a nonnegative table with rows $i$ and columns $j$.
In basket analysis, a useful table is:
- rows = products
- columns = baskets
- cell $n_{ij} = 1$ if product $i$ appears in basket $j$
Masses
Let:
\[n = \sum_i \sum_j n_{ij}\]The correspondence matrix is:
\[P = \frac{N}{n}\]Row masses:
\[r_i = \sum_j p_{ij}\]Column masses:
\[c_j = \sum_i p_{ij}\]Profiles
A row profile converts a row into conditional proportions:
\[\frac{p_{ij}}{r_i}\]It describes the distribution of a row across columns.
For products, a row profile describes the basket pattern of a product.
Chi-Square Geometry
Correspondence analysis compares profiles with chi-square distance.
For two row profiles $i$ and $i’$:
\[d^2(i,i') = \sum_j \frac{1}{c_j}\left(\frac{p_{ij}}{r_i} - \frac{p_{i'j}}{r_{i'}}\right)^2\]Differences on rare columns get more weight than differences on common columns.
Standardized Residual Matrix
Correspondence analysis decomposes:
\[S = D_r^{-1/2}(P - rc^T)D_c^{-1/2}\]where:
- $D_r$ is the diagonal matrix of row masses
- $D_c$ is the diagonal matrix of column masses
- $rc^T$ is the independence model
Then it applies SVD to $S$.
Inertia
Inertia is the correspondence-analysis version of variance.
Each dimension has inertia:
\[\lambda_k = s_k^2\]where $s_k$ is a singular value.
Explained inertia is:
\[\frac{\lambda_k}{\sum_m \lambda_m}\]