A graph matching method based on Integer Linear Programming

This notebook is a follow up of the notebook called "a graph matching method based on the leading Eigen vector and Sinkhorn-Knopp algorithm"

This notebook shows how to perform graph matching thanks to Integer Linear Programming.

It is based on several papers

Authors : Romain Raveaux (romain.raveaux@univ-tours.fr), Puqing Chen (c.puqing@gmail.com)

Web site : http://romain.raveaux.free.fr

The graph matching problem

The following text is better displayed and more accurate in the PDF file : http://romain.raveaux.free.fr/document/AroundGraphMatchingRelaxedAndRefined.pdf

The objective of graph matching is to find correspondences between two attributed graphs $G_1$ and $G_2$. A solution of graph matching is defined as a subset of possible correspondences $\mathcal{Y} \subseteq V_1 \times V_2$, represented by a binary assignment matrix $Y \in \{0,1 \}^{n1 \times n2}$, where $n1$ and $n2$ denote the number of nodes in $G_1$ and $G_2$, respectively. If $u_i \in V_1$ matches $v_k \in V_2$, then $Y_{i,k}=1$, and $Y_{i,k}=0$ otherwise. We denote by $y\in \{0,1 \}^{n1 . n2}$, a column-wise vectorized replica of $Y$. With this notation, graph matching problems can be expressed as finding the assignment vector $y^*$ that maximizes a score function $S(G_1, G_2, y)$ as follows:

\begin{align} y^* =& \underset{y} {\mathrm{argmax}} \quad S(G_1, G_2, y)\\ \text{subject to}\quad & y \in \{0, 1\}^{n1.n2} \\ &\sum_{i=1}^{n1} y_{i,k} \leq 1 \quad \forall k \in [1, \cdots, n2]\\ &\sum_{k=1}^{n2} y_{i,k} \leq 1 \quad \forall i \in [1, \cdots, n1] \end{align}

where equations \eqref{eq:subgmgmc},\eqref{eq:subgmgmd} induces the matching constraints, thus making $y$ an assignment vector.

The function $S(G_1, G_2, y)$ measures the similarity of graph attributes, and is typically decomposed into a first order dissimilarity function $s(u_i,v_k)$ for a node pair $u_{i} \in V_1$ and $v_k \in V_2$, and a second-order similarity function $s(e_{ij},e_{kl})$ for an edge pair $e_{ij} \in E_1$ and $e_{kl} \in E_2$. Thus, the objective function of graph matching is defined as: \begin{equation} \begin{aligned} S(G_1,G_2,y) =& \sum_{i=1}^{n1} \sum_{k=1}^{n2} s(u_i,v_k) \cdot y_{i,k} + \sum_{i=1}^{n1} \sum_{j=1}^{n1} \sum_{k=1}^{n2} \sum_{l=1}^{n2} s(e_{ij},e_{kl})\cdot y_{ik} \cdot y_{jl}\\ =& \sum_{i=1}^{n1} \sum_{j=1}^{n1} \sum_{k=1}^{n2} \sum_{l=1}^{n2} K_{ik,jl}\cdot y_{ik} \cdot y_{jl}\\ \quad \quad \quad \quad =&y^TKy \end{aligned} \end{equation}

Similarity functions are usually represented by a symmetric similarity matrix $K \in \mathbb{R}^{n1n2 \times n1n2}$. A non-diagonal element $K_{ik,jl} = s(e_{ij},e_{kl})$ contains the edge similarity and a diagonal term $K_{ik,ik} = s(u_i,v_k)$ represents the vertex similarity. $K$ is called an affinity matrix or compatibility matrix. In this representation, $K_{ij,kl}=0$ means an impossible matching or a very dissimilar matching. In essence, the score accumulates all the similarity values that are relevant to the assignment. The formulation in Problem \ref{model:iqpsubgm} is referred to as an integer quadratic programming. This integer quadratic programming expresses the quadratic assignment problem, which is known to be NP-hard.

Relaxed graph matching problem

The NP-hard graph matching problem is relaxed by dropping both the binary and the mapping constraints. The model to be solved is then:

\begin{equation} \begin{aligned} y^* =& \underset{y} {\mathrm{argmax}} \quad y^TKy \end{aligned} \end{equation} \begin{align} \text{subject to}\quad & y \in [0,1]^ {\vert V_1 \vert \cdot \vert V_2 \vert }\\ %\label{eq:subgmgmc} &\Vert y \Vert_2 =1 % & \sqrt{\sum_{i=1}^{\vert V_1 \vert} \sum_{k=1}^{\vert V_2 \vert} (y_{i,k})^2} = 1 \end{align}

Solving the relaxed graph matching problem

The optimal $y^*$ is then given by the leading eigenvector of the matrix $K$.

Eigen vectors

First, we need to find the eigen values $\lambda$ such that : \begin{equation} P=\vert K - \lambda I \vert \end{equation} Setting the polynomial (P) equal to zero: $P=0$. The polynomial has roots at $\lambda_i \quad \forall i \in [1, \cdots, n1.n2]$. $\lambda_i$ are the two eigenvalues of K. The eigenvectors corresponding to each eigenvalue can be found by solving for the components of v in the equation: \begin{equation} (K - \lambda I)m_i=0 \quad \forall i \in [1, \cdots, n1.n2] \end{equation} The eigenvector associated to the highest eigenvalue is the solution of the Problem \ref{model:relaxedQAPK}. \begin{equation} k^*= arg \max_{k \in [1, \cdots, n1.n2 ] } \lambda_k \end{equation}

\begin{equation} y^* = m_{k^*} \end{equation}

Power iteration

Computing the leading eigenvector $y^*$ of the affinity matrix $K$ can be done using power iterations. \begin{equation} m^{(k+1)}=\frac{K m^{(k)}}{\Vert K m^{(k)} \Vert_2} \end{equation} $m_0 =1$ is initialize to 1. $N$ iterations are run to output the vector $y^* =m^{(N)}$.

The time complexity of this algorithm per power iteration is $O(n1^2.n2^2)$ when the matrix K is dense.

Refined the relaxed graph matching problem

The solution $y^*$ of the relaxed graph matching problem (Problem \ref{model:relaxedQAPK}) is reshaped to become a matrix $Y^* \in \mathbb{R}^{n1 \times n2}$. Then, the matrix $Y^*$ is modified to become a doubly stochastic matrix. In the graph matching context, this modification represents the adding of the one-to-one mapping constraints (L1 constraints) $\forall k, \sum_{i} y_{ik} = 1$ and $\forall i, \sum_{k} y_{ik} = 1$. The transformation of the matrix $Y^*$ to a doubly stochastic matrix is performed by the Sinkhorn-Knopp algorithm: Starting with $M^{(0)}=Y^*$. $$M^{(k+1)}=M_k[1_{n1}^TM^{(k)}]^{-1}$$ $$M^{(k+2)}=[M^{(k+1)}1_{n2}]^{-1}M^{(k+1)}$$ $$M^{(k+1)}_{i,j}=\frac{M^{(k)}_{i,j}}{\sum_j M^{(k)}_{i,j}} \quad \forall i \in [1, \cdots, n1]$$ $$M^{(k+2)}_{i,j}=\frac{M^{(k+1)}_{i,j}}{\sum_i M^{(k+1)}_{i,j}} \quad \forall j \in [1, \cdots, n2]$$ With : $1_{n1} \in {1}^{n1 \times 1}$ and $1_{n2} \in {1}^{n2 \times 1}$.

Integer Linear Programming for Graph Matching

An Integer Linear Program (ILP) is a mathematical model where the objective function is a linear combination of the variables. The objective function is constrained by linear combinations of the variables. The Interger Linear Programming aims at solving the "exact" graph mathcing problem by exact we mean the not relaxed problem.

The entire formulation is called F2 see the reference for the full details and described as follows :

image.png

Where the function $c$ is a cost function defined a the opposite of the similarity function (c=-s).

Let's code !

  1. Read two graphs
  2. Define the similarity function
  3. Compute the cost function $c$
  4. Implementing the Interger Linear Program
  5. Compute the matching by solving the Interger Linear Program

Install requirements

Read the graphs

  1. Datasets and functions to read data are taken from http://gmprdia.univ-lr.fr/

Download Data

Letter Database

Graphs that represent distorted letter drawings. They consider the 15 capital letters of the Roman alphabet that consist of straight lines only (A, E, F, H, I, K, L, M, N, T, V, W, X, Y, Z). Each node is labeled with a two-dimensional attribute giving its position relative to a reference coordinate system. Edges are unlabeled. The graph database consists of a training set, a validation set, and a test set of size 750 each. Also, three levels of distortions are provided.

This dataset is part of IAM Graph Database Repository and it is also linked in the IAPR TC15 resources.

It can be considered as a TOY EXAMPLE for graph classification.

Riesen, K. and Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning. In: da Vitora Lobo, N. et al. (Eds.), SSPR&SPR 2008, LNCS, vol. 5342, pp. 287-297, 2008.

Prepare data reader

IAM graphs are provided as a GXL file:

<gxl>
  <graph id="GRAPH_ID" edgeids="false" edgemode="undirected">
    <node id="_0">
      <attr name="x">
        <float>0.812867</float>
      </attr>
      <attr name="y">
        <float>0.630453</float>
      </attr>
    </node>
    ...
    <node id="_N">
      ...
    </node>
    <edge from="_0" to="_1"/>
    ...
    <edge from="_M" to="_N"/>
  </graph>
</gxl>

Some imports

Read a graph

Load Data with NetworkX

We load 2 graphs and we display them

Similarity functions

Simiarity functions are of first interest. Since they drive the optimization problem. We will present two differents kind of similarities just for pedagogical purpose.

First, we extract node features and ajacency matrix.

Nodes features and adjacency matrix

Node features are the position (x,y) in a 2D space.

Node Similarity : Version 1

Let us see if we use the scalar product as a similarity measure. Features are node position (x,y) in a 2D space

Converting similarities to costs

The first order similarities are stored to vector and converted to costs

Let's build edge similarities. Here we do not compute similarities but dissimilarities. Edge dissimilarities are the normalized distances between two nodes.

EdgeSimilarities represent a distance so it is not a similarity but a cost.

Let us make a function to compute costs : Version 1 (with features)

Now let us compute similarities without attributes/features.

Similairies between nodes equal 1 always. Similairies between edges equal 1 always.

Only the structure of the graph matters. The problem falls into subgraph isomorhism.

The two kind of similarities (version 1 and Version 2) built with and without features information represent two different problems. The objective function based on similarities version 2 may not drive very well the solver and many symmetries may appear in the problem because eveyrthing is similar. So the solving task may be more challenging for this version.

To be able to create the Interger Linear Model, we need to crate a list of edges.

Creating edges_src and edges_dst.

edges_src: a torch.tensor os shape (|E_1|, 2) describing edges of G_1,

edges_dst: a torch.tensor os shape (|E_2|, 2) describing edges of G_2.

Graph Matching by Integer Linear Programming

Let us implement the model:

Visualization of the matching

Let's make a function that builds the edge lists and the costs

Let us do some animation with costs based on features (costs version 1)

Let us do some animation with costs without features. Subgraph isomorphism problem is solved. (costs version 2)

Let us matching two graphs without attributes

Let us create 2 simple graphs

G1 creation

G2 creation

Let us try to match these two graphs with an heuristic method.

This Heuristic is based on Eigen Vector and Sinkhorn-Knopp algorithm.

The notebook can be found in : "a graph matching method based on the leading Eigen vector and Sinkhorn-Knopp algorithm"

Refine the matching by the Sinkhorn-Knopp algorithm