ba-thesis/latex/thesis/chapters/theoretical_background.tex

\chapter{Theoretical Background}%
\label{chapter:theoretical_background}

In this chapter, the theoretical background necessary to understand this
work is given.
First, the used notation is clarified.
The physical aspects are detailed - the used modulation scheme and channel model.
A short introduction of channel coding with binary linear codes and especially
\ac{LDPC} codes is given.
The established methods of decoding LPDC codes are briefly explained.
Lastly, the optimization methods utilized are described.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{General Remarks on Notation}
\label{sec:theo:Notation}

Wherever the domain of a variable is expanded, this will be indicated with a tilde.
For example:%
%
\begin{align*}
    x \in \left\{ -1, 1 \right\} &\to \tilde{x} \in \mathbb{R}\\
    c \in \mathbb{F}_2           &\to \tilde{c} \in \left[ 0, 1 \right]
.\end{align*}
%
Additionally, a shorthand notation will be used to denote series of indices and series
of indexed variables:%
%
\begin{align*}
    \left[ m:n \right]      &:= \left\{ m, m+1, \ldots, n-1, n \right\} \\
    x_{\left[ m:n \right] } &:= \left\{ x_m, x_{m+1}, \ldots, x_{n-1}, x_n \right\}
.\end{align*}
%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Preliminaries: Channel Model and Modulation}
\label{sec:theo:Preliminaries: Channel Model and Modulation}

In order to transmit a bit-word $\boldsymbol{c}$ of length $n$ over a channel,
it has to be mapped onto a symbol $\boldsymbol{x}$ that can be physically
transmitted.
This is known as modulation. The modulation scheme chosen here is \ac{BPSK}:%
%
\begin{align*}
    \boldsymbol{x} = \left( -1 \right)^{\boldsymbol{c}}
.\end{align*}
%
The symbol that reaches the receiver, $\boldsymbol{y}$, is distorted by the channel.
This distortion is described by the channel model, which here is chosen to be \ac{AWGN}:%
%
\begin{align*}
    \boldsymbol{y} = \boldsymbol{x} + \boldsymbol{z},
        \hspace{5mm} z_i \in \mathcal{N}\left( 0, \frac{\sigma^2}{2} \right),
            \hspace{2mm} i \in \left[ 1:n \right]
.\end{align*}
%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Channel Coding with LDPC Codes}
\label{sec:theo:Channel Coding with LDPC Codes}

Channel coding describes the process of adding redundancy to information
transmitted over a channel in order to detect and correct any errors
that may occur during the transmission.
Encoding the information using \textit{binary linear codes} is one way of
conducting this process, whereby \textit{data words} are mapped onto longer
\textit{codewords}, which carry redundant information.
\Ac{LDPC} codes have become especially popular, since they are able to
reach arbitrarily small probabilities of error at coderates up to the capacity
of the channel \cite[Sec. II.B.]{mackay_rediscovery} and their structure allows
for very efficient decoding.

The lengths of the data words and codewords are denoted by $k$ and $n$,
respectively.
The set of codewords $\mathcal{C} \subset \mathbb{F}_2^n$ of a binary
linear code can be represented using the \textit{parity-check matrix}
$\boldsymbol{H} \in \mathbb{F}_2^{m\times n}$, where $m$ represents
the number of parity-checks:%
%
\begin{align*}
    \mathcal{C} := \left\{ \boldsymbol{c} \in \mathbb{F}_2^n :
            \boldsymbol{H}\boldsymbol{c}^\text{T} = \boldsymbol{0} \right\}
.\end{align*}
%
A data word $\boldsymbol{u} \in \mathbb{F}_2^k$ can be mapped onto a codword
$\boldsymbol{c} \in \mathbb{F}_2^n$ using the \textit{generator matrix}
$\boldsymbol{G} \in \mathbb{F}_2^{k\times n}$:%
%
\begin{align*}
    \boldsymbol{c} = \boldsymbol{u}\boldsymbol{G}
.\end{align*}
%

After obtaining a codeword from a data word, it is transmitted over a channel
as described in section \ref{sec:theo:Preliminaries: Channel Model and Modulation}.
The received signal $\boldsymbol{y}$ is then decoded to obtain
an estimate of the transmitted codeword, $\hat{\boldsymbol{c}}$.
Finally, the encoding procedure is reversed and an estimate of the originally
sent data word, $\hat{\boldsymbol{u}}$, is obtained.
The methods examined in this work are all based on \textit{soft-decision} decoding,
i.e., $\boldsymbol{y}$ is considered to be in $\mathbb{R}^n$ and no preliminary decision
is made by a demodulator.
The process of transmitting and decoding a codeword is visualized in
figure \ref{fig:theo:channel_overview}.%
%
\begin{figure}[H]
    \centering

    \tikzstyle{box} = [rectangle, minimum width=1.5cm, minimum height=0.7cm,
                              rounded corners=0.1cm, text centered, draw=black, fill=KITgreen!80]

    \begin{tikzpicture}[scale=1, transform shape]
        \node (c) {$\boldsymbol{c}$};
        \node[box, right=0.5cm of c] (bpskmap) {Mapper};
        \node[right=1.5cm of bpskmap,
              draw, circle, inner sep=0pt, minimum size=0.5cm] (add) {$+$};
        \node[box, right=1.5cm of add] (decoder) {Decoder};
        \node[box, right=1.5cm of decoder] (demapper) {Demapper};
        \node[right=0.5cm of demapper] (out) {$\boldsymbol{\hat{c}}$};

        \node (x) at ($(bpskmap.east)!0.5!(add.west) + (0,0.3cm)$) {$\boldsymbol{x}$};
        \node (y) at ($(add.east)!0.5!(decoder.west) + (0,0.3cm)$) {$\boldsymbol{y}$};
        \node (x_hat) at ($(decoder.east)!0.5!(demapper.west) + (0,0.3cm)$)
            {$\boldsymbol{\hat{x}}$};
        \node[below=0.5cm of add] (z) {$\boldsymbol{z}$};

        \draw[->] (c) -- (bpskmap);
        \draw[->] (bpskmap) -- (add);
        \draw[->] (add) -- (decoder);
        \draw[->] (z) -- (add);
        \draw[->] (decoder) -- (demapper);
        \draw[->] (demapper) -- (out);

        \coordinate (top_left)      at ($(x.north west) + (-0.1cm, 0.1cm)$);
        \coordinate (top_right)     at ($(y.north east) + (+0.1cm, 0.1cm)$);
        \coordinate (bottom_center) at ($(z.south)      + (0cm,   -0.1cm)$);
        \draw[dashed] (top_left) -- (top_right) |- (bottom_center) -| cycle;
        \node[below=0.25cm of z] (text) {Channel};
    \end{tikzpicture}

    \caption{Overview of channel model and modulation}
    \label{fig:theo:channel_overview}
\end{figure}

\todo{$\boldsymbol{z}$ is used to denote both the noise and the auxiliary variable for ADMM}
\todo{Mapper $\to$ Modulator?}

The decoding process itself is generally based either on the \ac{MAP} or the \ac{ML}
criterion:%
%
\begin{align*}
    \hat{\boldsymbol{c}}_{\text{\ac{MAP}}} &= \argmax_{\boldsymbol{c} \in \mathcal{C}}
    p_{\boldsymbol{C} \mid \boldsymbol{Y}} \left(\boldsymbol{c} \mid \boldsymbol{y}
        \right) \\
    \hat{\boldsymbol{c}}_{\text{\ac{ML}}} &= \argmax_{\boldsymbol{c} \in \mathcal{C}}
    f_{\boldsymbol{Y} \mid \boldsymbol{C}} \left( \boldsymbol{y} \mid \boldsymbol{c}
        \right)
.\end{align*}%
%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Tanner Graphs and Belief Propagation}
\label{sec:theo:Tanner Graphs and Belief Propagation}

It is often helpful to visualize codes graphically.
This is especially true for \ac{LDPC} codes, as the established decoding
algorithms are \textit{message passing algorithms}, which are inherently
graph-based.

Binary linear codes with a parity-check matrix $\boldsymbol{H}$ can be
visualized using a \textit{Tanner} or \textit{factor graph}:
Each row of $\boldsymbol{H}$, which represents one parity-check, is viewed as a
\ac{CN}.
Each component of the codeword $\boldsymbol{c}$ is interpreted as a \ac{VN}.
The relationship between \acp{CN} and \acp{VN} can then be plotted by noting
which components of $\boldsymbol{c}$ are considered for which parity-check.
Figure \ref{fig:theo:tanner_graph} shows the tanner graph for the
(7,4) Hamming code, which has the following parity-check matrix
\cite[Example 5.7.]{ryan_lin_2009}:%
%
\begin{align*}
    \boldsymbol{H} = \begin{bmatrix}
                        1 & 0 & 1 & 0 & 1 & 0 & 1 \\
                        0 & 1 & 1 & 0 & 0 & 1 & 1 \\
                        0 & 0 & 0 & 1 & 1 & 1 & 1
                     \end{bmatrix}
.\end{align*}
%
%
\begin{figure}[H]
    \centering

    \tikzstyle{checknode} = [color=KITblue, fill=KITblue,
                            draw, regular polygon,regular polygon sides=4,
                            inner sep=0pt, minimum size=12pt]
    \tikzstyle{variablenode} = [color=KITgreen, fill=KITgreen,
                            draw, circle, inner sep=0pt, minimum size=10pt]

    \begin{tikzpicture}[scale=1, transform shape]
        \node[checknode,
              label={[below, label distance=-0.4cm, align=center]
              \acs{CN} 1\\$\left( c_1 + c_3 + c_5 + c_7 = 0 \right) $}]
            (cn1) at (-4, -1) {};
        \node[checknode,
              label={[below, label distance=-0.4cm, align=center]
              \acs{CN} 2\\$\left( c_2 + c_3 + c_6 + c_7 = 0 \right) $}]
            (cn2) at (0, -1) {};
        \node[checknode,
              label={[below, label distance=-0.4cm, align=center]
              \acs{CN} 3\\$\left( c_4 + c_5 + c_6 + c_7 = 0 \right) $}]
            (cn3) at (4, -1) {};
        \node[variablenode, label={[above, align=center] \acs{VN} 1\\$c_1$}] (c1) at (-4.5, 2) {};
        \node[variablenode, label={[above, align=center] \acs{VN} 2\\$c_2$}] (c2) at (-3, 2)   {};
        \node[variablenode, label={[above, align=center] \acs{VN} 3\\$c_3$}] (c3) at (-1.5, 2) {};
        \node[variablenode, label={[above, align=center] \acs{VN} 4\\$c_4$}] (c4) at (0, 2)    {};
        \node[variablenode, label={[above, align=center] \acs{VN} 5\\$c_5$}] (c5) at (1.5, 2)  {};
        \node[variablenode, label={[above, align=center] \acs{VN} 6\\$c_6$}] (c6) at (3, 2)    {};
        \node[variablenode, label={[above, align=center] \acs{VN} 7\\$c_7$}] (c7) at (4.5, 2)  {};

        \draw (cn1) -- (c1);
        \draw (cn1) -- (c3);
        \draw (cn1) -- (c5);
        \draw (cn1) -- (c7);

        \draw (cn2) -- (c2);
        \draw (cn2) -- (c3);
        \draw (cn2) -- (c6);
        \draw (cn2) -- (c7);

        \draw (cn3) -- (c4);
        \draw (cn3) -- (c5);
        \draw (cn3) -- (c6);
        \draw (cn3) -- (c7);
    \end{tikzpicture}

    \caption{Tanner graph for the (7,4)-Hamming-code}
    \label{fig:theo:tanner_graph}
\end{figure}%
%
\noindent \acp{CN} and \acp{VN}, and by extention the rows and columns of
$\boldsymbol{H}$, are indexed with the variables $j$ and $i$.
The sets of all \acp{CN} and all \acp{VN} are denoted by
$\mathcal{J} := \left[ 1:m \right]$ and $\mathcal{I} := \left[ 1:n \right]$, respectively.
The \textit{neighbourhood} of the $j$th \ac{CN}, i.e., the set of all adjacent \acp{VN},
is denoted by $N_c\left( j \right)$.
The neighbourhood of the $i$th \ac{VN} is denoted by $N_v\left( i \right)$.
For the code depicted in figure \ref{fig:theo:tanner_graph}, for example,
$N_c\left( 1 \right) = \left\{ 1, 3, 5, 7 \right\}$ and
$N_v\left( 3 \right) = \left\{ 1, 2 \right\}$.

\todo{Define $d_i$ and $d_j$}

Message passing algorithms are based on the notion of passing messages between
\acp{CN} and \acp{VN}.
\Ac{BP} is one such algorithm that is commonly used to decode \ac{LDPC} codes.
It aims to compute the posterior probabilities
$p_{C_i \mid \boldsymbol{Y}}\left(c_i = 1 | \boldsymbol{y} \right),\hspace{2mm} i\in\mathcal{I}$
\cite[Sec. III.]{mackay_rediscovery} and use them to calculate the estimate $\hat{\boldsymbol{c}}$.
For cycle-free graphs this goal is reached after a finite
number of steps and \ac{BP} is thus equivalent to \ac{MAP} decoding.
When the graph contains cycles, however, \ac{BP} only approximates the probabilities
and is sub-optimal.
This leads to generally worse performance than \ac{MAP} decoding for practical codes.
Additionally, an \textit{error floor} appears for very high \acp{SNR}, making
the use of \ac{BP} impractical for applications where a very low \ac{BER} is
desired \cite[Sec. 15.3]{ryan_lin_2009}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Optimization Methods}
\label{sec:theo:Optimization Methods}

\textit{Proximal algorithms} are algorithms for solving convex optimization
problems, that rely on the use of \textit{proximal operators}.
The proximal operator $\textbf{prox}_f : \mathbb{R}^n \rightarrow \mathbb{R}^n$
of a function $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is defined by
\cite[Sec. 1.1]{proximal_algorithms}%
%
\begin{align*}
    \textbf{prox}_{\lambda f}\left( \boldsymbol{v} \right) = \argmin_{\boldsymbol{x}} \left(
        f\left( \boldsymbol{x} \right) + \frac{1}{2\lambda}\lVert \boldsymbol{x}
            - \boldsymbol{v} \rVert_2^2 \right)
.\end{align*}
%
This operator computes a point that is a compromise between minimizing $f$
and staying in the proximity of $\boldsymbol{v}$.
The parameter $\lambda$ determines how heavily each term is weighed.
The \textit{proximal gradient method} is an iterative optimization method used to
solve problems of the form%
%
\begin{align*}
    \text{minimize}\hspace{5mm}f\left( \boldsymbol{x} \right) + g\left( \boldsymbol{x} \right)
\end{align*}
%
that consists of two steps: minimizing $f$ with gradient descent
and minimizing $g$ using the proximal operator
\cite[Sec. 4.2]{proximal_algorithms}:%
%
\begin{align*}
    \boldsymbol{x} \leftarrow \boldsymbol{x} - \lambda \nabla f\left( \boldsymbol{x} \right) \\
    \boldsymbol{x} \leftarrow \textbf{prox}_{\lambda g} \left( \boldsymbol{x} \right)
,\end{align*}
%
Since $g$ is minimized with the proximal operator and is thus not required
to be differentiable, it can be used to encode the constraints of the problem.

A special case of convex optimization problems are \textit{linear programs}.
These are problems where the objective function is linear and the constraints
consist of linear equalities and inequalities.
Generally, any linear program can be expressed in \textit{standard form}%
\footnote{The inequality $\boldsymbol{x} \ge \boldsymbol{0}$ is to be
interpreted componentwise.}
\cite[Sec. 1.1]{intro_to_lin_opt_book}:%
%
\begin{alignat}{3}
    \begin{alignedat}{3}
        \text{minimize }\hspace{2mm}   && \boldsymbol{\gamma}^\text{T} \boldsymbol{x}         \\
        \text{subject to }\hspace{2mm} && \boldsymbol{A}\boldsymbol{x}   & = \boldsymbol{b}   \\
                                       &&               \boldsymbol{x}   & \ge \boldsymbol{0}.
    \end{alignedat}
    \label{eq:theo:admm_standard}
\end{alignat}%
%
A technique called \textit{lagrangian relaxation} \cite[Sec. 11.4]{intro_to_lin_opt_book}
can then be applied.
First, some of the constraints are moved into the objective function itself
and the weights $\boldsymbol{\lambda}$ are introduced. A new, relaxed problem
is formulated:
%
\begin{align}
    \begin{aligned}
        \text{minimize }\hspace{2mm}   & \boldsymbol{\gamma}^\text{T}\boldsymbol{x}
            + \boldsymbol{\lambda}^\text{T}\left(\boldsymbol{b}
                - \boldsymbol{A}\boldsymbol{x} \right)  \\
        \text{subject to }\hspace{2mm} & \boldsymbol{x} \ge \boldsymbol{0},
    \end{aligned}
    \label{eq:theo:admm_relaxed}
\end{align}%
%
the new objective function being the \textit{lagrangian}%
%
\begin{align*}
\mathcal{L}\left( \boldsymbol{x}, \boldsymbol{\lambda} \right)
    = \boldsymbol{\gamma}^\text{T}\boldsymbol{x}
        + \boldsymbol{\lambda}^\text{T}\left(\boldsymbol{b}
            - \boldsymbol{A}\boldsymbol{x} \right)
.\end{align*}%
%
This problem is not directly equivalent to the original one, as the
solution now depends on the choice of the \textit{lagrange multipliers}
$\boldsymbol{\lambda}$.
Interestingly, however, for this particular class of problems,
the minimum of the objective function (herafter called \textit{optimal objective})
of the relaxed problem (\ref{eq:theo:admm_relaxed}) is a lower bound for
the optimal objective of the original problem (\ref{eq:theo:admm_standard})
\cite[Sec. 4.1]{intro_to_lin_opt_book}:%
%
\begin{align*}
    \min_{\substack{\boldsymbol{x} \ge \boldsymbol{0} \\ \phantom{a}}}
        \mathcal{L}\left( \boldsymbol{x}, \boldsymbol{\lambda}
        \right)
    \le
    \min_{\substack{\boldsymbol{x} \ge \boldsymbol{0} \\ \boldsymbol{A}\boldsymbol{x}
            = \boldsymbol{b}}}
        \boldsymbol{\gamma}^\text{T}\boldsymbol{x}
.\end{align*}
%
Furthermore, for uniquely solvable linear programs \textit{strong duality}
always holds \cite[Theorem 4.4]{intro_to_lin_opt_book}.
This means that not only is it a lower bound, the tightest lower
bound actually reaches the value itself:
In other words, with the optimal choice of $\boldsymbol{\lambda}$,
the optimal objectives of the problems (\ref{eq:theo:admm_relaxed})
and (\ref{eq:theo:admm_standard}) have the same value.
%
\begin{align*}
    \max_{\boldsymbol{\lambda}} \, \min_{\boldsymbol{x} \ge \boldsymbol{0}}
        \mathcal{L}\left( \boldsymbol{x}, \boldsymbol{\lambda} \right)
    = \min_{\substack{\boldsymbol{x} \ge \boldsymbol{0} \\ \boldsymbol{A}\boldsymbol{x}
            = \boldsymbol{b}}}
        \boldsymbol{\gamma}^\text{T}\boldsymbol{x}
.\end{align*}
%
Thus, we can define the \textit{dual problem} as the search for the tightest lower bound:%
%
\begin{align}
    \text{maximize }\hspace{2mm} & \min_{\boldsymbol{x} \ge \boldsymbol{0}} \mathcal{L}
        \left( \boldsymbol{x}, \boldsymbol{\lambda} \right)
    \label{eq:theo:dual}
,\end{align}
%
and recover the solution $\boldsymbol{x}_{\text{opt}}$ to problem (\ref{eq:theo:admm_standard})
from the solution $\boldsymbol{\lambda}_\text{opt}$ to problem (\ref{eq:theo:dual})
by computing \cite[Sec. 2.1]{admm_distr_stats}%
%
\begin{align}
    \boldsymbol{x}_{\text{opt}} = \argmin_{\boldsymbol{x}}
        \mathcal{L}\left( \boldsymbol{x}, \boldsymbol{\lambda}_{\text{opt}} \right)
    \label{eq:theo:admm_obtain_primal}
.\end{align}
%

The dual problem can then be solved iteratively using \textit{dual ascent}: starting with an
initial estimate for $\boldsymbol{\lambda}$, calculate an estimate for $\boldsymbol{x}$
using equation (\ref{eq:theo:admm_obtain_primal}); then, update $\boldsymbol{\lambda}$
using gradient descent \cite[Sec. 2.1]{admm_distr_stats}:%
%
\begin{align*}
    \boldsymbol{x} &\leftarrow \argmin_{\boldsymbol{x}} \mathcal{L}\left(
        \boldsymbol{x}, \boldsymbol{\lambda} \right) \\
    \boldsymbol{\lambda} &\leftarrow \boldsymbol{\lambda}
        + \alpha\left( \boldsymbol{A}\boldsymbol{x} - \boldsymbol{b} \right),
    \hspace{5mm} \alpha > 0
.\end{align*}
%
The algorithm can be improved by observing that when the objective function is separable in $\boldsymbol{x}$, the lagrangian is as well:
%
\begin{align*}
    \text{minimize }\hspace{5mm} & \sum_{i=1}^{N} g_i\left( \boldsymbol{x}_i \right)  \\
    \text{subject to}\hspace{5mm} & \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i
        = \boldsymbol{b}
\end{align*}
\begin{align*}
    \mathcal{L}\left( \boldsymbol{x}_{[1:N]}, \boldsymbol{\lambda} \right)
        = \sum_{i=1}^{N} g_i\left( \boldsymbol{x}_i \right)
            + \boldsymbol{\lambda}^\text{T} \left( \boldsymbol{b}
            - \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x_i} \right)
.\end{align*}%
%
The minimization of each term can then happen in parallel, in a distributed fasion
\cite[Sec. 2.2]{admm_distr_stats}.
This modified version of dual ascent is called \textit{dual decomposition}:
%
\begin{align*}
    \boldsymbol{x}_i &\leftarrow \argmin_{\boldsymbol{x}_i}\mathcal{L}\left(
        \boldsymbol{x}_{[1:N]}, \boldsymbol{\lambda}\right)
        \hspace{5mm} \forall i \in [1:N]\\
    \boldsymbol{\lambda} &\leftarrow \boldsymbol{\lambda}
        + \alpha\left( \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i
            - \boldsymbol{b} \right),
        \hspace{5mm} \alpha > 0
.\end{align*}
%

The \ac{ADMM} works the same way as dual decomposition.
It only differs in the use of an \textit{augmented lagrangian}
$\mathcal{L}_\mu\left( \boldsymbol{x}_{[1:N]}, \boldsymbol{\lambda} \right)$
in order to robustify the convergence properties.
The augmented lagrangian extends the ordinary one with an additional penalty term
with the penaly parameter $\mu$:
%
\begin{align*}
    \mathcal{L}_\mu \left( \boldsymbol{x}_{[1:N]}, \boldsymbol{\lambda} \right)
        = \underbrace{\sum_{i=1}^{N} g_i\left( \boldsymbol{x_i} \right)
            + \boldsymbol{\lambda}^\text{T}\left( \boldsymbol{b}
        - \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i \right)}_{\text{Ordinary lagrangian}}
            + \underbrace{\frac{\mu}{2}\lVert \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i
            - \boldsymbol{b} \rVert_2^2}_{\text{Penalty term}},
        \hspace{5mm} \mu > 0
.\end{align*}
%
The steps to solve the problem are the same as with dual decomposition, with the added
condition that the step size be $\mu$:%
%
\begin{align*}
    \boldsymbol{x}_i &\leftarrow \argmin_{\boldsymbol{x}_i}\mathcal{L}_\mu\left(
        \boldsymbol{x}_{[1:N]}, \boldsymbol{\lambda}\right)
        \hspace{5mm} \forall i \in [1:N]\\
    \boldsymbol{\lambda} &\leftarrow \boldsymbol{\lambda}
        + \mu\left( \sum_{i=1}^{N} \boldsymbol{A}_i\boldsymbol{x}_i
            - \boldsymbol{b} \right),
        \hspace{5mm} \mu > 0
%    \boldsymbol{x}_1 &\leftarrow \argmin_{\boldsymbol{x}_1}\mathcal{L}_\mu\left(
%        \boldsymbol{x}_1, \boldsymbol{x_2}, \boldsymbol{\lambda}\right) \\
%    \boldsymbol{x}_2 &\leftarrow \argmin_{\boldsymbol{x}_2}\mathcal{L}_\mu\left(
%        \boldsymbol{x}_1, \boldsymbol{x_2}, \boldsymbol{\lambda}\right) \\
%    \boldsymbol{\lambda} &\leftarrow \boldsymbol{\lambda}
%        + \mu\left( \boldsymbol{A}_1\boldsymbol{x}_1 + \boldsymbol{A}_2\boldsymbol{x}_2
%            - \boldsymbol{b} \right),
%        \hspace{5mm} \mu > 0
.\end{align*}
%