Reworked proximal decoding up to and including the choice of parameters

This commit is contained in:
Andreas Tsouchlos 2023-04-11 18:40:15 +02:00
parent 46ebd5aedc
commit 5c135e085e

View File

@ -17,7 +17,8 @@ Proximal decoding was proposed by Wadayama et al. as a novel formulation of
optimization-based decoding \cite{proximal_paper}. optimization-based decoding \cite{proximal_paper}.
With this algorithm, minimization is performed using the proximal gradient With this algorithm, minimization is performed using the proximal gradient
method. method.
In contrast to \ac{LP} decoding, the objective function is based on a In contrast to \ac{LP} decoding, which will be covered in chapter
\ref{chapter:lp_dec_using_admm}, the objective function is based on a
non-convex optimization formulation of the \ac{MAP} decoding problem. non-convex optimization formulation of the \ac{MAP} decoding problem.
In order to derive the objective function, the authors begin with the In order to derive the objective function, the authors begin with the
@ -121,8 +122,9 @@ and the decoding problem is reformulated to%
.\end{align*} .\end{align*}
% %
For the solution of the approximate \ac{MAP} decoding problem, the two parts For the solution of the approximate \ac{MAP} decoding problem, using the
of equation (\ref{eq:prox:objective_function}) are considered separately: proximal gradient method, the two parts of equation
(\ref{eq:prox:objective_function}) are considered separately:
the minimization of the objective function occurs in an alternating the minimization of the objective function occurs in an alternating
fashion, switching between the negative log-likelihood fashion, switching between the negative log-likelihood
$L\left( \boldsymbol{y} \mid \boldsymbol{x} \right) $ and the scaled $L\left( \boldsymbol{y} \mid \boldsymbol{x} \right) $ and the scaled
@ -140,10 +142,8 @@ descent:%
.\end{align}% .\end{align}%
% %
For the second step, minimizing the scaled code-constraint polynomial, the For the second step, minimizing the scaled code-constraint polynomial, the
proximal gradient method is used \todo{The proximal gradient method is not \textit{proximal operator} of $\gamma h\left( \tilde{\boldsymbol{x}} \right) $
just used for the second step. It is the name for the alternating iterative process} has to be computed.
and the \textit{proximal operator} of
$\gamma h\left( \tilde{\boldsymbol{x}} \right) $ has to be computed.
It is then immediately approximated with gradient-descent:% It is then immediately approximated with gradient-descent:%
% %
\begin{align*} \begin{align*}
@ -258,7 +258,7 @@ It was subsequently reimplemented in C++ using the Eigen%
linear algebra library to achieve higher performance. linear algebra library to achieve higher performance.
The focus has been set on a fast implementation, sometimes at the expense of The focus has been set on a fast implementation, sometimes at the expense of
memory usage, somewhat limiting the size of the codes the implemenation can be memory usage, somewhat limiting the size of the codes the implemenation can be
used with \todo{Is this sentence appropriate for a bachelor's thesis?}. used with.
The evaluation of the simulation results has been wholly realized in Python. The evaluation of the simulation results has been wholly realized in Python.
The gradient of the code-constraint polynomial \cite[Sec. 2.3]{proximal_paper} The gradient of the code-constraint polynomial \cite[Sec. 2.3]{proximal_paper}
@ -309,8 +309,6 @@ matrix-vector multiplication.
This is beneficial, as the libraries employed for the implementation are This is beneficial, as the libraries employed for the implementation are
heavily optimized for such calculations (e.g., through vectorization of the heavily optimized for such calculations (e.g., through vectorization of the
operations). operations).
\todo{Note about how the equation with which the gradient is calculated is
itself similar to a message-passing rule}
The projection $\prod_{\eta}\left( . \right)$ also proves straightforward to The projection $\prod_{\eta}\left( . \right)$ also proves straightforward to
compute, as it amounts to simply clipping each component of the vector onto compute, as it amounts to simply clipping each component of the vector onto
@ -332,15 +330,13 @@ The convergence properties are reviewed and related to the decoding
performance. performance.
Finally, the computational performance is examined on a theoretical basis Finally, the computational performance is examined on a theoretical basis
as well as on the basis of the implementation completed in the context of this as well as on the basis of the implementation completed in the context of this
work. thesis.
All simulation results presented hereafter are based on Monte Carlo All simulation results presented hereafter are based on Monte Carlo
simulations. simulations.
The \ac{BER} and \ac{FER} curves in particular have been generated by The \ac{BER} and \ac{FER} curves in particular have been generated by
producing at least 100 frame-errors for each data point, unless otherwise producing at least 100 frame-errors for each data point, unless otherwise
stated. stated.
\todo{Mention number of datapoints from which each graph was created for
non ber and fer curves}
\subsection{Choice of Parameters} \subsection{Choice of Parameters}
@ -418,9 +414,9 @@ while the newly generated ones are shown with dashed lines.
\noindent It is noticeable that for a moderately chosen value of $\gamma$ \noindent It is noticeable that for a moderately chosen value of $\gamma$
($\gamma = 0.05$) the decoding performance is better than for low ($\gamma = 0.05$) the decoding performance is better than for low
($\gamma = 0.01$) or high ($\gamma = 0.15$) values. ($\gamma = 0.01$) or high ($\gamma = 0.15$) values.
The question arises if there is some optimal value maximazing the decoding The question arises whether there is some optimal value maximazing the decoding
performance, especially since it seems to dramatically depend on $\gamma$. performance, especially since it seems to dramatically depend on $\gamma$.
To better understand how $\gamma$ and the decoding performance are To better understand how they are
related, figure \ref{fig:prox:results} was recreated, but with a considerably related, figure \ref{fig:prox:results} was recreated, but with a considerably
larger selection of values for $\gamma$. larger selection of values for $\gamma$.
In this new graph, shown in figure \ref{fig:prox:results_3d}, instead of In this new graph, shown in figure \ref{fig:prox:results_3d}, instead of
@ -431,11 +427,7 @@ The previously shown results are highlighted.
Evidently, while the decoding performance does depend on the value of Evidently, while the decoding performance does depend on the value of
$\gamma$, there is no single optimal value offering optimal performance, but $\gamma$, there is no single optimal value offering optimal performance, but
rather a certain interval in which it stays largely unchanged. rather a certain interval in which it stays largely unchanged.
When examining a number of different codes (figure %
\ref{fig:prox:results_3d_multiple}), it is apparent that while the exact
landscape of the graph depends on the code, the general behaviour is the same
in each case.
\begin{figure}[h] \begin{figure}[h]
\centering \centering
@ -485,11 +477,15 @@ in each case.
\cite[\text{204.33.484}]{mackay_enc}; $\omega = 0.05, K=200, \eta=1.5$ \cite[\text{204.33.484}]{mackay_enc}; $\omega = 0.05, K=200, \eta=1.5$
}% }%
% %
\noindent This indicates that while the choice of the parameter $\gamma$ This indicates that while the choice of the parameter $\gamma$
significantly affects the decoding performance, there is not much benefit significantly affects the decoding performance, there is not much benefit
attainable in undertaking an extensive search for an exact optimum. attainable in undertaking an extensive search for an exact optimum.
Rather, a preliminary examination providing a rough window for $\gamma$ may Rather, a preliminary examination providing a rough window for $\gamma$ may
be sufficient. be sufficient.
When examining a number of different codes (figure
\ref{fig:prox:results_3d_multiple}), it is apparent that while the exact
landscape of the graph depends on the code, the general behaviour is the same
in each case.
The parameter $\gamma$ describes the step-size for the optimization step The parameter $\gamma$ describes the step-size for the optimization step
dealing with the code-constraint polynomial; dealing with the code-constraint polynomial;
@ -497,10 +493,12 @@ the parameter $\omega$ describes the step-size for the step dealing with the
negative-log likelihood. negative-log likelihood.
The relationship between $\omega$ and $\gamma$ is portrayed in figure The relationship between $\omega$ and $\gamma$ is portrayed in figure
\ref{fig:prox:gamma_omega}. \ref{fig:prox:gamma_omega}.
The color of each cell indicates the \ac{BER} when the corresponding values
are chosen for the parameters.
The \ac{SNR} is kept constant at $\SI{4}{dB}$. The \ac{SNR} is kept constant at $\SI{4}{dB}$.
Similar behaviour to $\gamma$ is exhibited: the \ac{BER} is minimized when The \ac{BER} exhibits similar behaviour in its dependency on $\omega$ and
keeping the value within certain bounds, without displaying a clear on $\gamma$: it is minimized when keeping the value within certain
optimum. bounds, without displaying a single clear optimum.
It is noteworthy that the decoder seems to achieve the best performance for It is noteworthy that the decoder seems to achieve the best performance for
similar values of the two step sizes. similar values of the two step sizes.
Again, this consideration applies to a multitude of different codes, as Again, this consideration applies to a multitude of different codes, as
@ -552,19 +550,21 @@ depicted in figure \ref{fig:prox:gamma_omega_multiple}.
To better understand how to determine the optimal value for the parameter $K$, To better understand how to determine the optimal value for the parameter $K$,
the average error is inspected. the average error is inspected.
This time $\gamma$ and $\omega$ are held constant and the average error is This time $\gamma$ and $\omega$ are held constant at $0.05$ and the average
observed during each iteration of the decoding process for a number of error is observed during each iteration of the decoding process, for a number
different \acp{SNR}. of different \acp{SNR}.
The plots have been generated by averaging the error over $\SI{500000}{}$ decodings. The plots have been generated by averaging the error over $\SI{500000}{}$
decodings.
As some decodings go one for more iterations than others, the number of values As some decodings go one for more iterations than others, the number of values
which are averaged for each datapoints vary. which are averaged for each datapoints vary.
This explains the dip visible in all curves around $k=20$, since after This explains the dip visible in all curves around $k=20$, since after
this point more and more correct decodings stop iterating, this point more and more correct decodings are completed,
leaving more and more faulty ones to be averaged. leaving more and more faulty ones to be averaged.
A this point the decline in the average error stagnates, rendering an Additionally, at this point the decline in the average error stagnates,
increase in $K$ counterproductive as it only raises the average timing rendering an increase in $K$ counterproductive as it only raises the average
requirements of the decoding process. timing requirements of the decoding process.
The higher the \ac{SNR}, the fewer decodings are present at each iteration Another aspect to consider is that the higher the \ac{SNR}, the fewer
decodings are present at each iteration
to average, since a solution is found earlier. to average, since a solution is found earlier.
This explains the decreasing smootheness of the lines as the \ac{SNR} rises. This explains the decreasing smootheness of the lines as the \ac{SNR} rises.
Remarkably, the \ac{SNR} seems to not have any impact on the number of Remarkably, the \ac{SNR} seems to not have any impact on the number of
@ -740,9 +740,6 @@ performance.
The decoding failure rate closely resembles the \ac{FER}, suggesting that The decoding failure rate closely resembles the \ac{FER}, suggesting that
the frame errors may largely be attributed to decoding failures. the frame errors may largely be attributed to decoding failures.
\todo{Maybe reference to the structure of the algorithm (1 part likelihood
1 part constraints)}
\subsection{Convergence Properties} \subsection{Convergence Properties}
\label{subsec:prox:conv_properties} \label{subsec:prox:conv_properties}