Differentiable Task Graph Learning

Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

We introduce a novel framework for learning task graphs from action sequences, which relies on maximum likelihood estimation to provide a differentiable loss function.
We propose two approaches to task graph learning based on direct optimization of the adjacency matrix and processing key-steps text or video embeddings.
We showcase the usefulness of task graph in general, in particular, on the downstream task of online mistake detection.

GitHub Paper

Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

We introduce a novel framework for learning task graphs from action sequences, which relies on maximum likelihood estimation to provide a differentiable loss function.
We propose two approaches to task graph learning based on direct optimization of the adjacency matrix and processing key-steps text or video embeddings.
We showcase the usefulness of task graph in general, in particular, on the downstream task of online mistake detection.

GitHub Paper

Task Graph Maximum Likelihood

Our main contribution is a novel framework for learning task graphs from action sequences, which relies on maximum likelihood estimation to provide a differentiable loss function which can be included in end-to-end models and optimized with gradient descent:

\begin{align} \mathcal{L}(\mathcal{Y},Z) = - \sum_{k = 1}^{|Y|} \sum_{t = 1}^{|y^{(k)}|} \big(\textcolor{cyan}{\log{\sum_{{j \in \mathcal{O}(y^{(k)},t)}}Z_{y_tj}}} - \beta \cdot \textcolor{teal}{\log{\sum_{{\substack{h \in \overline{\mathcal{O}(y^{(k)},t)}\\{j \in \mathcal{O}(y^{(k)},t)}}}} Z_{hj}}} \big) \label{eq:loss} \end{align}

Direct Optimization (DO)

The first model aims to directly optimize the parameters of the adjacency matrix by performing gradient descent on the TGML loss (Eq. (1)). We define the parameters of this model as an edge scoring matrix \(A \in \mathbb{R}^{(n+2) \times (n+2)}\), where \(n\) is the number of key-steps, plus the placeholder start (\(S\)) and end (\(E\)) nodes, and \(A_{ij}\) is a score assigned to edge \(K_i \rightarrow K_j\). To prevent the model from learning edge weights eluding the assumptions of directed acyclic graphs, we mask black cells in the figure below with \(-\infty\).

To constrain the elements of \(Z\) in the \([0,1]\) range and obtain normalized weights, we softmax-normalize the rows of the scoring matrix to obtain the adjacency matrix \(Z = softmax(A)\). Note that elements masked with \(-\infty\) will be automatically mapped to \(0\) by the softmax function similarly to Vaswani et al.. We train this model by performing batch gradient descent directly on the score matrix \(A\) with the proposed TGML loss. We train a separate model per procedure, as each procedure is associated to a different task graph. As many applications require an unweighted graph, we binarize the adjacency matrix with the threshold \(\frac{1}{n}\), where \(n\) is the number of nodes. We also employ a post-processing stage in which we remove redundant edges, loops, and add obvious missing connections to \(S\) and \(E\) nodes.

Task Graph Transformer (TGT)

Figure below illustrates the second proposed model, which is termed Task Graph Transformer (TGT). The proposed model can take as input either \(D\)-dimensional embeddings of textual descriptions of key-steps or \(D\)-dimensional video embeddings of key-step segments extracted from video. In the first case, the model takes as input the same set of embeddings at each forward pass, while in the second case, at each forward pass, we randomly sample a video embedding per key-step from the training videos (hence each key-step embedding can be sampled from a different video). We also include two \(D\)-dimensional learnable embeddings for the \(S\) and \(E\) nodes.

All key-step embeddings are processed by a transformer encoder, which outputs \(D\)-dimensional vectors enriched with information from other embeddings. To prevent representation collapse, we apply a regularization loss encouraging distinctiveness between pairs of different nodes. Let \(X\) be the matrix of embeddings produced by the transformer model. We L2-normalize features, then compute pairwise cosine similarities \(Y = X \cdot X^T \cdot \exp(T)\) as in Radford et al.. To prevent the transformer encoder from mapping distinct key-step embeddings to similar representations, we enforce the values outside the diagonal of \(Y\) to be smaller than the values in the diagonal. This is done by encouraging each row of the matrix \(Y\) to be close to a one-hot vector with a cross-entropy loss. Regularized embeddings are finally passed through a relation transformer head which considers all possible pairs of embeddings and concatenates them in a \((n+2) \times (n+2) \times 2D\) matrix \(R\) of relation vectors. For instance, \(R[i,j]\) is the concatenation of vectors \(X[i]\) and \(X[j]\). Relation vectors are passed to a transformer layer which aims to mine relationships among relation vectors, followed by a multilayer perceptron to reduce dimensionality to \(16\) units and another pair of transformer layer and multilayer perceptron to map relation vectors to scalar values, which are reshaped to size \((n+2) \times (n+2)\) to form the score matrix \(A\). We hence apply the same optimization procedure as in the DO method to supervise the whole architecture.

Resources

Data

Annotations

GitHub

Baselines

GitHub

Direct Optimization

GitHub

TGT

Task Graph Transformer

GitHub

Video

Video Understanding

GitHub

Mistake

Online Mistake Detection

GitHub

Features

GitHub

Paper

Seminara, Luigi, Giovanni Maria Farinella, and Antonino Furnari. "Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos." Cite our paper: OpenReview.
[September, 2024] Accepted at Neural Information Processing Systems (NeurIPS) 2024 as spotlight.
[October, 2024] We release the Differentiable Task Graph Learning codebase, checkpoints and features.

@inproceedings{
    seminara2024differentiable,
    title={Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos},
    author={Luigi Seminara and Giovanni Maria Farinella and Antonino Furnari},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024},
    url={https://openreview.net/forum?id=2HvgvB4aWq}
}

Visit our page dedicated to First Person Vision Research for other related publications.

People

Luigi
Seminara

FPV@IPLAB

Giovanni Maria
Farinella

FPV@IPLAB

Antonino
Furnari

FPV@IPLAB

This research has been partially supported by the PNRR PhD scholarship "Digital Innovation: Models, Systems and Applications" DM 118/2023.

This research has been partially supported by the project Future Artificial Intelligence Research (FAIR) – PNRR MUR Cod. PE0000013 - CUP: E63C22001940006.

This research has been partially supported by the Research Program PIAno di inCEntivi per la Ricerca di Ateneo 2020/2022 — Linea di Intervento 3 "Starting Grant" EVIPORES Project - University of Catania.