Fig 1. Anti slopping pipeline
The next step is to use the min-p filtering to constrain the adjusted distribution. This step selects the coherent candidates who meet the predefined probability threshold. Our anti-slop backtracking algorithm is as follows:
2. Target regularization
This acts on chosen and rejected tokens. It keeps them close to the pre-defined reference. It allows a small free region ( \(\tau_{\text{target}}) \) before the token is penalized.
\[ L_{\text{target}} = \frac{1}{|T|} \sum_{j \in T} \max(|y[j] - y_{\text{ref}}[j]| - \tau_{\text{target}}, 0)^2 \]
\[ \text{where,} \] \[ T = C \cup \{r\} \text{ contains all target tokens} \]
3. Non-target regularization
This anchors all other tokens to the reference. It prevents unintended changes in unrelated parts of a vocabulary.
\[ L_{\text{nontarget}} = \frac{1}{|N|} \sum_{j \in N} (y[j] - y_{\text{ref}}[j])^2 \]
\[ \text{Where,} \] \[ N \text{ are all nontarget tokens} \]
Total loss
\[ L_{\text{FTPO}} = L_{\text{pref}} + \lambda_{\text{target}}L_{\text{target}} + \lambda_{\text{nontarget}}L_{\text{nontarget}} \] \[ \text{Where,} \]
\[ \lambda_{\text{target}} \text{ and } \lambda_{\text{nontarget}} \text{ are weighting coefficients} \]
There are three principles of design that makes FTPO effective.
Logit-space operation: FTPO applies MSE loss to the raw "logits" (scores). This allows the model to target and change only the specific "chosen" and "rejected" tokens without disturbing unrelated parts of the vocabulary.
Margin deactivation: FTPO uses margin m. Once the gap between the good token and the bad token is wide enough, a weight variable \(w_c\) automatically drops to zero. This stops the training for that specific pair, thus preventing overtraining.
Two-part regularization: FTPO uses the two-part MSE loss that allows target logits to move relatively freely, while constraining the remaining vocabulary to the reference. This allows training to high preference accuracy while avoiding destructive logit divergences.
FTPO training
The diagram below shows the entire process of training data for FTPO:
Comparing FTPO and DPO provides us with some interesting results.
Fig.7. FTPO maintains writing quality as training progresses to higher preference accuracies, while DPO degrades sharply after the 40% accuracy mark. This experiment trains gemma-3-12b on a banlist of 1,000 items.
Fig 8: With FTPO, logits stay close to reference due to (1) the MSE loss terms and (2) the early switch-off feature which nulls the training signal for chosen tokens that are already winning vs rejected.