ba-thesis/latex/thesis/chapters/theoretical_background.tex

\chapter{Theoretical Background}%
\label{chapter:theoretical_background}

In this chapter, the theoretical background necessary to understand the
decoding algorithms examined in this work is given.
First, the notation used is clarified.
The physical layer is detailed - the used modulation scheme and channel model.
A short introduction to channel coding with binary linear codes and especially
\ac{LDPC} codes is given.
The established methods of decoding \ac{LDPC} codes are briefly explained.
Lastly, the general process of decoding using optimization techniques is described
and an overview of the utilized optimization methods is given.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{General Remarks on Notation}
\label{sec:theo:Notation}

Wherever the domain of a variable is expanded, this will be indicated with a tilde.
For example:%
%
\begin{align*}
    x \in \left\{ -1, 1 \right\} &\to \tilde{x} \in \mathbb{R}\\
    c \in \mathbb{F}_2           &\to \tilde{c} \in \left[ 0, 1 \right] \subseteq \mathbb{R}
.\end{align*}
%
Additionally, a shorthand notation will be used, denoting a set of indices as%
%
\begin{align*}
    \left[ m:n \right]      &:= \left\{ m, m+1, \ldots, n-1, n \right\},
        \hspace{5mm} m < n, \hspace{2mm} m,n\in\mathbb{Z}
.\end{align*}
%
In order to designate element-wise operations, in particular the \textit{Hadamard product}
and the \textit{Hadamard power}, the operator $\circ$ will be used:%
%
\begin{alignat*}{3}
    \boldsymbol{a} \circ \boldsymbol{b}
        &:= \begin{bmatrix} a_1 b_1 & \ldots & a_n b_n \end{bmatrix} ^\text{T},
    \hspace{5mm} &&\boldsymbol{a}, \boldsymbol{b} \in \mathbb{R}^n, \hspace{2mm} n\in \mathbb{N} \\
    \boldsymbol{a}^{\circ k} &:= \begin{bmatrix} a_1^k \ldots a_n^k \end{bmatrix}^\text{T},
    \hspace{5mm} &&\boldsymbol{a} \in \mathbb{R}^n, \hspace{2mm}n\in \mathbb{N}, k\in \mathbb{Z}
.\end{alignat*}
%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Channel Model and Modulation}
\label{sec:theo:Preliminaries: Channel Model and Modulation}

In order to transmit a bit-word $\boldsymbol{c} \in \mathbb{F}_2^n$ of length
$n$ over a channel, it has to be mapped onto a symbol
$\boldsymbol{x} \in \mathbb{R}^n$ that can be physically transmitted.
This is known as modulation. The modulation scheme chosen here is \ac{BPSK}:%
%
\begin{align*}
    \boldsymbol{x} = 1 - 2\boldsymbol{c}
.\end{align*}
%
The transmitted symbol is distorted by the channel and denoted as
$\boldsymbol{y} \in \mathbb{R}^n$.
This distortion is described by the channel model, which in the context of
this thesis is chosen to be \ac{AWGN}:%
%
\begin{align*}
    \boldsymbol{y} = \boldsymbol{x} + \boldsymbol{n},
        \hspace{5mm} n_i \in \mathcal{N}\left( 0, \frac{\sigma^2}{2} \right),
            \hspace{2mm} i \in \left[ 1:n \right]
.\end{align*}
%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Channel Coding with LDPC Codes}
\label{sec:theo:Channel Coding with LDPC Codes}

Channel coding describes the process of adding redundancy to information
transmitted over a channel in order to detect and correct any errors
that may occur during the transmission.
Encoding the information using \textit{binary linear codes} is one way of
conducting this process, whereby \textit{data words} are mapped onto longer
\textit{codewords}, which carry redundant information.
\Ac{LDPC} codes have become especially popular, since they are able to
reach arbitrarily small probabilities of error at code rates up to the capacity
of the channel \cite[Sec. II.B.]{mackay_rediscovery}, while having a structure
that allows for very efficient decoding.

The lengths of the data words and codewords are denoted by $k\in\mathbb{N}$
and $n\in\mathbb{N}$, respectively, with $k \le n$.
The set of codewords $\mathcal{C} \subset \mathbb{F}_2^n$ of a binary
linear code can be represented using the \textit{parity-check matrix}
$\boldsymbol{H} \in \mathbb{F}_2^{m\times n}$, where $m$ represents
the number of parity-checks:%
%
\begin{align*}
    \mathcal{C} := \left\{ \boldsymbol{c} \in \mathbb{F}_2^n :
            \boldsymbol{H}\boldsymbol{c}^\text{T} = \boldsymbol{0} \right\}
.\end{align*}
%
A data word $\boldsymbol{u} \in \mathbb{F}_2^k$ can be mapped onto a codeword
$\boldsymbol{c} \in \mathbb{F}_2^n$ using the \textit{generator matrix}
$\boldsymbol{G} \in \mathbb{F}_2^{k\times n}$:%
%
\begin{align*}
    \boldsymbol{c} = \boldsymbol{u}\boldsymbol{G}
.\end{align*}
%

After obtaining a codeword from a data word, it is transmitted over a channel
as described in section \ref{sec:theo:Preliminaries: Channel Model and Modulation}.
The received signal $\boldsymbol{y}$ is then decoded to obtain
an estimate of the transmitted codeword, denoted as $\hat{\boldsymbol{c}}$.
Finally, the encoding procedure is reversed and an estimate of the originally
sent data word, $\hat{\boldsymbol{u}}$, is produced.
The methods examined in this work are all based on \textit{soft-decision} decoding,
i.e., $\boldsymbol{y}$ is considered to be in $\mathbb{R}^n$ and no preliminary decision
is made by a demodulator.
The process of transmitting and decoding a codeword is visualized in
figure \ref{fig:theo:channel_overview}.%
%
\begin{figure}[H]
    \centering

    \tikzstyle{box} = [rectangle, minimum width=1.5cm, minimum height=0.7cm,
                              rounded corners=0.1cm, text centered, draw=black, fill=KITgreen!80]

    \begin{tikzpicture}[scale=1, transform shape]
        \node (c) {$\boldsymbol{c}$};
        \node[box, right=0.5cm of c] (bpskmap) {Mapper};
        \node[right=1.5cm of bpskmap,
              draw, circle, inner sep=0pt, minimum size=0.5cm] (add) {$+$};
        \node[box, right=1.5cm of add] (decoder) {Decoder};
        \node[box, right=1.5cm of decoder] (demapper) {Demapper};
        \node[right=0.5cm of demapper] (out) {$\boldsymbol{\hat{c}}$};

        \node (x) at ($(bpskmap.east)!0.5!(add.west) + (0,0.3cm)$) {$\boldsymbol{x}$};
        \node (y) at ($(add.east)!0.5!(decoder.west) + (0,0.3cm)$) {$\boldsymbol{y}$};
        \node (x_hat) at ($(decoder.east)!0.5!(demapper.west) + (0,0.3cm)$)
            {$\boldsymbol{\hat{x}}$};
        \node[below=0.5cm of add] (n) {$\boldsymbol{n}$};

        \draw[->] (c) -- (bpskmap);
        \draw[->] (bpskmap) -- (add);
        \draw[->] (add) -- (decoder);
        \draw[->] (n) -- (add);
        \draw[->] (decoder) -- (demapper);
        \draw[->] (demapper) -- (out);

        \coordinate (top_left)      at ($(x.north west) + (-0.1cm, 0.1cm)$);
        \coordinate (top_right)     at ($(y.north east) + (+0.1cm, 0.1cm)$);
        \coordinate (bottom_center) at ($(n.south)      + (0cm,   -0.1cm)$);
        \draw[dashed] (top_left) -- (top_right) |- (bottom_center) -| cycle;
        \node[below=0.25cm of n] (text) {Channel};
    \end{tikzpicture}

    \caption{Overview of channel model and modulation}
    \label{fig:theo:channel_overview}
\end{figure}

The decoding process itself is generally based either on the \ac{MAP} or the \ac{ML}
criterion:%
%
\begin{align*}
    \hat{\boldsymbol{c}}_{\text{\ac{MAP}}} &= \argmax_{\boldsymbol{c} \in \mathcal{C}}
    p_{\boldsymbol{C} \mid \boldsymbol{Y}} \left(\boldsymbol{c} \mid \boldsymbol{y}
        \right) \\
    \hat{\boldsymbol{c}}_{\text{\ac{ML}}} &= \argmax_{\boldsymbol{c} \in \mathcal{C}}
    f_{\boldsymbol{Y} \mid \boldsymbol{C}} \left( \boldsymbol{y} \mid \boldsymbol{c}
        \right)
.\end{align*}%
%
The two criteria are closely connected through Bayes' theorem and are equivalent
when the prior probability of transmitting a codeword is the same for all
codewords:
%
\begin{align*}
    \argmax_{c\in\mathcal{C}} p_{\boldsymbol{C} \mid \boldsymbol{Y}}
        \left( \boldsymbol{c} \mid \boldsymbol{y} \right)
    &= \argmax_{c\in\mathcal{C}} \frac{f_{\boldsymbol{Y} \mid \boldsymbol{C}}
        \left( \boldsymbol{y} \mid \boldsymbol{c} \right) p_{\boldsymbol{C}}
        \left( \boldsymbol{c} \right)}{f_{\boldsymbol{Y}}\left( \boldsymbol{y} \right) } \\
%    &= \argmax_{c\in\mathcal{C}} f_{\boldsymbol{Y} \mid \boldsymbol{C}}
%        \left( \boldsymbol{y} \mid \boldsymbol{c} \right) p_{\boldsymbol{C}}
%        \left( \boldsymbol{c} \right) \\
    &= \argmax_{c\in\mathcal{C}}f_{\boldsymbol{Y} \mid \boldsymbol{C}}
        \left( \boldsymbol{y} \mid \boldsymbol{c} \right)
.\end{align*}
%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Tanner Graphs and Belief Propagation}
\label{sec:theo:Tanner Graphs and Belief Propagation}

It is often helpful to visualize codes graphically.
This is especially true for \ac{LDPC} codes, as the established decoding
algorithms are \textit{message passing algorithms}, which are inherently
graph-based.

A binary linear code with a parity-check matrix $\boldsymbol{H}$ can be
visualized using a \textit{Tanner} or \textit{factor graph}:
Each row of $\boldsymbol{H}$, which represents one parity-check, is viewed as a
\ac{CN}.
Each component of the codeword $\boldsymbol{c}$ is interpreted as a \ac{VN}.
The relationship between \acp{CN} and \acp{VN} can then be plotted by noting
which components of $\boldsymbol{c}$ are considered for which parity-check.
Figure \ref{fig:theo:tanner_graph} shows the Tanner graph for the
(7,4) Hamming code, which has the following parity-check matrix
\cite[Example 5.7.]{ryan_lin_2009}:%
%
\begin{align*}
    \boldsymbol{H} = \begin{bmatrix}
                        1 & 0 & 1 & 0 & 1 & 0 & 1 \\
                        0 & 1 & 1 & 0 & 0 & 1 & 1 \\
                        0 & 0 & 0 & 1 & 1 & 1 & 1
                     \end{bmatrix}
.\end{align*}
%
%
\begin{figure}[H]
    \centering

    \tikzstyle{checknode} = [color=KITblue, fill=KITblue,
                            draw, regular polygon,regular polygon sides=4,
                            inner sep=0pt, minimum size=12pt]
    \tikzstyle{variablenode} = [color=KITgreen, fill=KITgreen,
                            draw, circle, inner sep=0pt, minimum size=10pt]

    \begin{tikzpicture}[scale=1, transform shape]
        \node[checknode,
              label={[below, label distance=-0.4cm, align=center]
              \acs{CN} 1\\$\left( c_1 + c_3 + c_5 + c_7 = 0 \right) $}]
            (cn1) at (-4, -1) {};
        \node[checknode,
              label={[below, label distance=-0.4cm, align=center]
              \acs{CN} 2\\$\left( c_2 + c_3 + c_6 + c_7 = 0 \right) $}]
            (cn2) at (0, -1) {};
        \node[checknode,
              label={[below, label distance=-0.4cm, align=center]
              \acs{CN} 3\\$\left( c_4 + c_5 + c_6 + c_7 = 0 \right) $}]
            (cn3) at (4, -1) {};
        \node[variablenode, label={[above, align=center] \acs{VN} 1\\$c_1$}] (c1) at (-4.5, 2) {};
        \node[variablenode, label={[above, align=center] \acs{VN} 2\\$c_2$}] (c2) at (-3, 2)   {};
        \node[variablenode, label={[above, align=center] \acs{VN} 3\\$c_3$}] (c3) at (-1.5, 2) {};
        \node[variablenode, label={[above, align=center] \acs{VN} 4\\$c_4$}] (c4) at (0, 2)    {};
        \node[variablenode, label={[above, align=center] \acs{VN} 5\\$c_5$}] (c5) at (1.5, 2)  {};
        \node[variablenode, label={[above, align=center] \acs{VN} 6\\$c_6$}] (c6) at (3, 2)    {};
        \node[variablenode, label={[above, align=center] \acs{VN} 7\\$c_7$}] (c7) at (4.5, 2)  {};

        \draw (cn1) -- (c1);
        \draw (cn1) -- (c3);
        \draw (cn1) -- (c5);
        \draw (cn1) -- (c7);

        \draw (cn2) -- (c2);
        \draw (cn2) -- (c3);
        \draw (cn2) -- (c6);
        \draw (cn2) -- (c7);

        \draw (cn3) -- (c4);
        \draw (cn3) -- (c5);
        \draw (cn3) -- (c6);
        \draw (cn3) -- (c7);
    \end{tikzpicture}

    \caption{Tanner graph for the (7,4) Hamming code}
    \label{fig:theo:tanner_graph}
\end{figure}%
%
\noindent \acp{CN} and \acp{VN}, and by extension the rows and columns of
$\boldsymbol{H}$, are indexed with the variables $j$ and $i$.
The sets of all \acp{CN} and all \acp{VN} are denoted by
$\mathcal{J} := \left[ 1:m \right]$ and $\mathcal{I} := \left[ 1:n \right]$, respectively.
The \textit{neighborhood} of the $j$th \ac{CN}, i.e., the set of all adjacent \acp{VN},
is denoted by $N_c\left( j \right)$.
The neighborhood of the $i$th \ac{VN} is denoted by $N_v\left( i \right)$.
For the code depicted in figure \ref{fig:theo:tanner_graph}, for example,
$N_c\left( 1 \right) = \left\{ 1, 3, 5, 7 \right\}$ and
$N_v\left( 3 \right) = \left\{ 1, 2 \right\}$.
The degree $d_j$ of a \ac{CN} is defined as the number of adjacent \acp{VN}:
$d_j := \left| N_c\left( j \right)  \right| $; the degree of a \ac{VN} is
similarly defined as $d_i := \left| N_v\left( i \right)  \right|$.

Message passing algorithms are based on the notion of passing messages between
\acp{CN} and \acp{VN}.
\Ac{BP} is one such algorithm that is commonly used to decode \ac{LDPC} codes.
It aims to compute the posterior probabilities
$p_{C_i \mid \boldsymbol{Y}}\left(c_i = 1 | \boldsymbol{y} \right),\hspace{2mm} i\in\mathcal{I}$,
see \cite[Sec. III.]{mackay_rediscovery} and use them to calculate the estimate
$\hat{\boldsymbol{c}}$.
For cycle-free graphs this goal is reached after a finite
number of steps and \ac{BP} is equivalent to \ac{MAP} decoding.
When the graph contains cycles, however, \ac{BP} only approximates the \ac{MAP} probabilities
and is sub-optimal.
This leads to generally worse performance than \ac{MAP} decoding for practical codes.
Additionally, an \textit{error floor} appears for very high \acp{SNR}, making
the use of \ac{BP} impractical for applications where a very low error rate is
desired \cite[Sec. 15.3]{ryan_lin_2009}.
Another popular decoding method for \ac{LDPC} codes is the
\textit{min-sum algorithm}.
This is a simplification of \ac{BP} using an approximation of the
non-linear $\tanh$ function to improve the computational performance.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Decoding using Optimization Methods}%
\label{sec:theo:Decoding using Optimization Methods}

%
% General methodology
%

The general idea behind using optimization methods for channel decoding
is to reformulate the decoding problem as an optimization problem.
This new formulation can then be solved with one of the many
available optimization algorithms.

Generally, the original decoding problem considered is either the \ac{MAP} or
the \ac{ML} decoding problem:%
%
\begin{align}
    \hat{\boldsymbol{c}}_{\text{\ac{MAP}}} &= \argmax_{\boldsymbol{c} \in \mathcal{C}}
    p_{\boldsymbol{C} \mid \boldsymbol{Y}} \left(\boldsymbol{c} \mid \boldsymbol{y}
        \right) \label{eq:dec:map}\\
    \hat{\boldsymbol{c}}_{\text{\ac{ML}}} &= \argmax_{\boldsymbol{c} \in \mathcal{C}}
    f_{\boldsymbol{Y} \mid \boldsymbol{C}} \left( \boldsymbol{y} \mid \boldsymbol{c}
        \right) \label{eq:dec:ml}
.\end{align}%
%
The goal is to arrive at a formulation, where a certain objective function
$g : \mathbb{R}^n \rightarrow \mathbb{R} $ must be minimized under certain constraints:%
%
\begin{align*}
    \text{minimize}\hspace{2mm}   &g\left( \tilde{\boldsymbol{c}} \right)\\
    \text{subject to}\hspace{2mm} &\tilde{\boldsymbol{c}} \in D
,\end{align*}%
%
where $D \subseteq \mathbb{R}^n$ is the domain of values attainable for $\tilde{\boldsymbol{c}}$
and represents the constraints under which the minimization is to take place.

In contrast to the established message-passing decoding algorithms,
the perspective then changes from observing the decoding process in its
Tanner graph representation with \acp{VN} and \acp{CN} (as shown in figure \ref{fig:dec:tanner})
to a spatial representation (figure \ref{fig:dec:spatial}),
where the codewords are some of the vertices of a hypercube.
The goal is to find the point $\tilde{\boldsymbol{c}}$,
which minimizes the objective function $g$.

%
% Figure showing decoding space
%

\begin{figure}[H]
    \centering

    \begin{subfigure}[c]{0.47\textwidth}
        \centering

        \tikzstyle{checknode} = [color=KITblue, fill=KITblue,
                                draw, regular polygon,regular polygon sides=4,
                                inner sep=0pt, minimum size=12pt]
        \tikzstyle{variablenode} = [color=KITgreen, fill=KITgreen,
                                draw, circle, inner sep=0pt, minimum size=10pt]

        \begin{tikzpicture}[scale=1, transform shape]
            \node[checknode,
                  label={[below, label distance=-0.4cm, align=center]
                  \acs{CN}\\$\left( c_1 + c_2 + c_3 = 0 \right) $}]
                (cn) at (0, 0) {};
            \node[variablenode, label={[above, align=center] \acs{VN}\\$\left( c_1 \right)$}]
                (c1) at (-2, 2) {};
            \node[variablenode, label={[above, align=center] \acs{VN}\\$\left( c_2 \right)$}]
                (c2) at (0, 2) {};
            \node[variablenode, label={[above, align=center] \acs{VN}\\$\left( c_3 \right)$}]
                (c3) at (2, 2) {};

            \draw (cn) -- (c1);
            \draw (cn) -- (c2);
            \draw (cn) -- (c3);
        \end{tikzpicture}

        \caption{Tanner graph representation of a single parity-check code}
        \label{fig:dec:tanner}
    \end{subfigure}%
    \hfill%
    \begin{subfigure}[c]{0.47\textwidth}
        \centering

        \tikzstyle{codeword} = [color=KITblue, fill=KITblue,
                                draw, circle, inner sep=0pt, minimum size=4pt]

        \tdplotsetmaincoords{60}{25}
        \begin{tikzpicture}[scale=1, transform shape, tdplot_main_coords]
            % Cube

            \coordinate (p000) at (0, 0, 0);
            \coordinate (p001) at (0, 0, 2);
            \coordinate (p010) at (0, 2, 0);
            \coordinate (p011) at (0, 2, 2);
            \coordinate (p100) at (2, 0, 0);
            \coordinate (p101) at (2, 0, 2);
            \coordinate (p110) at (2, 2, 0);
            \coordinate (p111) at (2, 2, 2);

            \draw[] (p000) -- (p100);
            \draw[] (p100) -- (p101);
            \draw[] (p101) -- (p001);
            \draw[] (p001) -- (p000);

            \draw[dashed] (p010) -- (p110);
            \draw[]       (p110) -- (p111);
            \draw[]       (p111) -- (p011);
            \draw[dashed] (p011) -- (p010);

            \draw[dashed] (p000) -- (p010);
            \draw[]       (p100) -- (p110);
            \draw[]       (p101) -- (p111);
            \draw[]       (p001) -- (p011);

            % Polytope Vertices

            \node[codeword] (c000) at (p000) {};
            \node[codeword] (c101) at (p101) {};
            \node[codeword] (c110) at (p110) {};
            \node[codeword] (c011) at (p011) {};

            % Polytope Edges

%            \draw[line width=1pt, color=KITblue] (c000) -- (c101);
%            \draw[line width=1pt, color=KITblue] (c000) -- (c110);
%            \draw[line width=1pt, color=KITblue] (c000) -- (c011);
%
%            \draw[line width=1pt, color=KITblue] (c101) -- (c110);
%            \draw[line width=1pt, color=KITblue] (c101) -- (c011);
%
%            \draw[line width=1pt, color=KITblue] (c011) -- (c110);

            % Polytope Annotations

            \node[color=KITblue, below=0cm of c000]    {$\left( 0, 0, 0 \right) $};
            \node[color=KITblue, right=0.17cm of c101] {$\left( 1, 0, 1 \right) $};
            \node[color=KITblue, right=0cm of c110]    {$\left( 1, 1, 0 \right) $};
            \node[color=KITblue, above=0cm of c011]    {$\left( 0, 1, 1 \right) $};

            % c

            \node[color=KITgreen, fill=KITgreen,
                  draw, circle, inner sep=0pt, minimum size=4pt] (c) at (0.9, 0.7, 1) {};
            \node[color=KITgreen, right=0cm of c] {$\tilde{\boldsymbol{c}}$};
        \end{tikzpicture}

        \caption{Spatial representation of a single parity-check code}
        \label{fig:dec:spatial}
    \end{subfigure}%

    \caption{Different representations of the decoding problem}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{A Short Introduction to the Proximal Gradient Method and ADMM}
\label{sec:theo:Optimization Methods}

In this section, the general ideas behind the optimization methods used in
this work are outlined.
The application of these optimization methods to channel decoding decoding
will be discussed in later chapters.
Two methods are introduced, the \textit{proximal gradient method} and
\ac{ADMM}.

\textit{Proximal algorithms} are algorithms for solving convex optimization
problems that rely on the use of \textit{proximal operators}.
The proximal operator $\textbf{prox}_{\lambda f} : \mathbb{R}^n \rightarrow \mathbb{R}^n$
of a function $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is defined by
\cite[Sec. 1.1]{proximal_algorithms}%
%
\begin{align*}
    \textbf{prox}_{\lambda f}\left( \boldsymbol{v} \right)
        = \argmin_{\boldsymbol{x} \in \mathbb{R}^n} \left(
            f\left( \boldsymbol{x} \right) + \frac{1}{2\lambda}\lVert \boldsymbol{x}
                - \boldsymbol{v} \rVert_2^2 \right)
.\end{align*}
%
This operator computes a point that is a compromise between minimizing $f$
and staying in the proximity of $\boldsymbol{v}$.
The parameter $\lambda$ determines how each term is weighed.
The proximal gradient method is an iterative optimization method
utilizing proximal operators, used to solve problems of the form%
%
\begin{align*}
    \underset{\boldsymbol{x} \in \mathbb{R}^n}{\text{minimize}}\hspace{5mm}
        f\left( \boldsymbol{x} \right) + g\left( \boldsymbol{x} \right)
\end{align*}
%
that consists of two steps: minimizing $f$ with gradient descent
and minimizing $g$ using the proximal operator
\cite[Sec. 4.2]{proximal_algorithms}:%
%
\begin{align*}
    \boldsymbol{x} &\leftarrow \boldsymbol{x} - \lambda \nabla f\left( \boldsymbol{x} \right) \\
    \boldsymbol{x} &\leftarrow \textbf{prox}_{\lambda g} \left( \boldsymbol{x} \right)
,\end{align*}
%
Since $g$ is minimized with the proximal operator and is thus not required
to be differentiable, it can be used to encode the constraints of the optimization problem
(e.g., in the form of an \textit{indicator function}, as mentioned in
\cite[Sec. 1.2]{proximal_algorithms}).

\ac{ADMM} is another optimization method.
In this thesis it will be used to solve a \textit{linear program}, which
is a special type of convex optimization problem in which the objective function
is linear and the constraints consist of linear equalities and inequalities.
Generally, any linear program can be expressed in \textit{standard form}%
\footnote{The inequality $\boldsymbol{x} \ge \boldsymbol{0}$ is to be
interpreted componentwise.}
\cite[Sec. 1.1]{intro_to_lin_opt_book}:%
%
\begin{alignat}{3}
    \begin{alignedat}{3}
        \underset{\boldsymbol{x}\in\mathbb{R}^n}{\text{minimize }}\hspace{2mm}
            && \boldsymbol{\gamma}^\text{T} \boldsymbol{x}         \\
        \text{subject to }\hspace{2mm} && \boldsymbol{A}\boldsymbol{x}   & = \boldsymbol{b}   \\
                                       &&               \boldsymbol{x}   & \ge \boldsymbol{0},
    \end{alignedat}
    \label{eq:theo:admm_standard}
\end{alignat}%
%
where $\boldsymbol{x}, \boldsymbol{\gamma} \in \mathbb{R}^n$, $\boldsymbol{b} \in \mathbb{R}^m$
and $\boldsymbol{A}\in\mathbb{R}^{m \times n}$.
A technique called \textit{Lagrangian relaxation} can then be applied
\cite[Sec. 11.4]{intro_to_lin_opt_book}.
First, some of the constraints are moved into the objective function itself
and weights $\boldsymbol{\lambda}$ are introduced. A new, relaxed problem
is formulated as
%
\begin{align}
    \begin{aligned}
        \underset{\boldsymbol{x}\in\mathbb{R}^n}{\text{minimize }}\hspace{2mm}
            & \boldsymbol{\gamma}^\text{T}\boldsymbol{x}
            + \boldsymbol{\lambda}^\text{T}\left(
                \boldsymbol{A}\boldsymbol{x} - \boldsymbol{b}\right)  \\
        \text{subject to }\hspace{2mm} & \boldsymbol{x} \ge \boldsymbol{0},
    \end{aligned}
    \label{eq:theo:admm_relaxed}
\end{align}%
%
the new objective function being the \textit{Lagrangian}%
\footnote{
    Depending on what literature is consulted, the definition of the Lagrangian differs
    in the order of $\boldsymbol{A}\boldsymbol{x}$ and $\boldsymbol{b}$.
    As will subsequently be seen, however, the only property of the Lagrangian having
    any bearing on the optimization process is that minimizing it gives a lower bound
    on the optimal objective of the original problem.
    This property is satisfied no matter the order of the terms and the order
    chosen here is the one used in the \ac{LP} decoding literature making use of
    \ac{ADMM}.
}%
%
\begin{align*}
\mathcal{L}\left( \boldsymbol{x}, \boldsymbol{\lambda} \right)
    = \boldsymbol{\gamma}^\text{T}\boldsymbol{x}
        + \boldsymbol{\lambda}^\text{T}\left(
            \boldsymbol{A}\boldsymbol{x} - \boldsymbol{b}\right)
.\end{align*}%
%

This problem is not directly equivalent to the original one, as the
solution now depends on the choice of the \textit{Lagrange multipliers}
$\boldsymbol{\lambda}$.
Interestingly, however, for this particular class of problems,
the minimum of the objective function (hereafter called \textit{optimal objective})
of the relaxed problem (\ref{eq:theo:admm_relaxed}) is a lower bound for
the optimal objective of the original problem (\ref{eq:theo:admm_standard})
\cite[Sec. 4.1]{intro_to_lin_opt_book}:%
%
\begin{align*}
    \min_{\substack{\boldsymbol{x} \ge \boldsymbol{0} \\ \phantom{a}}}
        \mathcal{L}\left( \boldsymbol{x}, \boldsymbol{\lambda}
        \right)
    \le
    \min_{\substack{\boldsymbol{x} \ge \boldsymbol{0} \\ \boldsymbol{A}\boldsymbol{x}
            = \boldsymbol{b}}}
        \boldsymbol{\gamma}^\text{T}\boldsymbol{x}
.\end{align*}
%
Furthermore, for uniquely solvable linear programs \textit{strong duality}
always holds \cite[Theorem 4.4]{intro_to_lin_opt_book}.
This means that not only is it a lower bound, the tightest lower
bound actually reaches the value itself:
in other words, with the optimal choice of $\boldsymbol{\lambda}$,
the optimal objectives of the problems (\ref{eq:theo:admm_relaxed})
and (\ref{eq:theo:admm_standard}) have the same value, i.e.,
%
\begin{align*}
    \max_{\boldsymbol{\lambda}\in\mathbb{R}^m} \, \min_{\boldsymbol{x} \ge \boldsymbol{0}}
        \mathcal{L}\left( \boldsymbol{x}, \boldsymbol{\lambda} \right)
    = \min_{\substack{\boldsymbol{x} \ge \boldsymbol{0} \\ \boldsymbol{A}\boldsymbol{x}
            = \boldsymbol{b}}}
        \boldsymbol{\gamma}^\text{T}\boldsymbol{x}
.\end{align*}
%
Thus, we can define the \textit{dual problem} as the search for the tightest lower bound:%
%
\begin{align}
    \underset{\boldsymbol{\lambda}\in\mathbb{R}^m}{\text{maximize }}\hspace{2mm}
        & \min_{\boldsymbol{x} \ge \boldsymbol{0}} \mathcal{L}
        \left( \boldsymbol{x}, \boldsymbol{\lambda} \right)
    \label{eq:theo:dual}
,\end{align}
%
and recover the solution $\boldsymbol{x}_{\text{opt}}$ to problem (\ref{eq:theo:admm_standard})
from the solution $\boldsymbol{\lambda}_\text{opt}$ to problem (\ref{eq:theo:dual})
by computing \cite[Sec. 2.1]{distr_opt_book}%
%
\begin{align}
    \boldsymbol{x}_{\text{opt}} = \argmin_{\boldsymbol{x} \ge \boldsymbol{0}}
        \mathcal{L}\left( \boldsymbol{x}, \boldsymbol{\lambda}_{\text{opt}} \right)
    \label{eq:theo:admm_obtain_primal}
.\end{align}
%

The dual problem can then be solved iteratively using \textit{dual ascent}: starting with an
initial estimate for $\boldsymbol{\lambda}$, calculate an estimate for $\boldsymbol{x}$
using equation (\ref{eq:theo:admm_obtain_primal}); then, update $\boldsymbol{\lambda}$
using gradient descent \cite[Sec. 2.1]{distr_opt_book}:%
%
\begin{align*}
    \boldsymbol{x} &\leftarrow \argmin_{\boldsymbol{x} \ge \boldsymbol{0}} \mathcal{L}\left(
        \boldsymbol{x}, \boldsymbol{\lambda} \right) \\
    \boldsymbol{\lambda} &\leftarrow \boldsymbol{\lambda}
        + \alpha\left( \boldsymbol{A}\boldsymbol{x} - \boldsymbol{b} \right),
    \hspace{5mm} \alpha > 0
.\end{align*}
%
The algorithm can be improved by observing that when the objective function
$g: \mathbb{R}^n \rightarrow \mathbb{R}$ is separable into a sum of
$N \in \mathbb{N}$ sub-functions
$g_i: \mathbb{R}^{n_i} \rightarrow \mathbb{R}$,
i.e., $g\left( \boldsymbol{x} \right) = \sum_{i=1}^{N} g_i
\left( \boldsymbol{x}_i \right)$,
where $\boldsymbol{x}_i\in\mathbb{R}^{n_i},\hspace{1mm} i\in [1:N]$ are subvectors of
$\boldsymbol{x}$, the Lagrangian is as well:
%
\begin{align*}
    \text{minimize }\hspace{5mm} & \sum_{i=1}^{N} g_i\left( \boldsymbol{x}_i \right)  \\
    \text{subject to}\hspace{5mm} & \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i
        = \boldsymbol{b}
\end{align*}
\begin{align*}
    \mathcal{L}\left( \left( \boldsymbol{x}_i \right)_{i=1}^N, \boldsymbol{\lambda} \right)
        = \sum_{i=1}^{N} g_i\left( \boldsymbol{x}_i \right)
            + \boldsymbol{\lambda}^\text{T} \left(
            \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x_i} - \boldsymbol{b}\right)
.\end{align*}%
%
The matrices $\boldsymbol{A}_i \in \mathbb{R}^{m \times n_i}, \hspace{1mm} i \in [1:N]$
form a partition of $\boldsymbol{A}$, corresponding to
$\boldsymbol{A} = \begin{bmatrix}
    \boldsymbol{A}_1 &
    \ldots &
    \boldsymbol{A}_N
\end{bmatrix}$.
The minimization of each term can happen in parallel, in a distributed
fashion \cite[Sec. 2.2]{distr_opt_book}.
In each minimization step, only one subvector $\boldsymbol{x}_i$ of
$\boldsymbol{x}$ is considered, regarding all other subvectors as being
constant.
This modified version of dual ascent is called \textit{dual decomposition}:
%
\begin{align*}
    \boldsymbol{x}_i &\leftarrow \argmin_{\boldsymbol{x}_i \ge \boldsymbol{0}}\mathcal{L}\left(
        \left( \boldsymbol{x}_i \right)_{i=1}^N, \boldsymbol{\lambda}\right)
        \hspace{5mm} \forall i \in [1:N]\\
    \boldsymbol{\lambda} &\leftarrow \boldsymbol{\lambda}
        + \alpha\left( \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i
            - \boldsymbol{b} \right),
        \hspace{5mm} \alpha > 0
.\end{align*}
%

\ac{ADMM} works the same way as dual decomposition.
It only differs in the use of an \textit{augmented Lagrangian}
$\mathcal{L}_\mu\left( \left( \boldsymbol{x} \right)_{i=1}^N, \boldsymbol{\lambda} \right)$
in order to strengthen the convergence properties.
The augmented Lagrangian extends the classical one with an additional penalty term
with the penalty parameter $\mu$:
%
\begin{align*}
    \mathcal{L}_\mu \left( \left( \boldsymbol{x} \right)_{i=1}^N, \boldsymbol{\lambda} \right)
        = \underbrace{\sum_{i=1}^{N} g_i\left( \boldsymbol{x_i} \right)
            + \boldsymbol{\lambda}^\text{T}\left(\sum_{i=1}^{N}
                \boldsymbol{A}_i\boldsymbol{x}_i - \boldsymbol{b}\right)}
                _{\text{Classical Lagrangian}}
            + \underbrace{\frac{\mu}{2}\left\Vert \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i
            - \boldsymbol{b} \right\Vert_2^2}_{\text{Penalty term}},
        \hspace{5mm} \mu > 0
.\end{align*}
%
The steps to solve the problem are the same as with dual decomposition, with the added
condition that the step size be $\mu$:%
%
\begin{align*}
    \boldsymbol{x}_i &\leftarrow \argmin_{\boldsymbol{x}_i \ge \boldsymbol{0}}\mathcal{L}_\mu\left(
        \left( \boldsymbol{x} \right)_{i=1}^N, \boldsymbol{\lambda}\right)
        \hspace{5mm} \forall i \in [1:N]\\
    \boldsymbol{\lambda} &\leftarrow \boldsymbol{\lambda}
        + \mu\left( \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i
            - \boldsymbol{b} \right),
        \hspace{5mm} \mu > 0
.\end{align*}
%

In subsequent chapters, the decoding problem will be reformulated as an
optimization problem using two different methodologies.
In chapter \ref{chapter:proximal_decoding}, a non-convex optimization approach
is chosen and addressed using the proximal gradient method.
In chapter \ref{chapter:lp_dec_using_admm}, an \ac{LP} based optimization problem is
formulated and solved using \ac{ADMM}.