How to improve the quality of Large Language Models and solve the alignment problem

Alasdair Forsythe
6 min readMay 6, 2023

There are 2 main factors holding back model quality:

  1. Just throwing massive datasets of synthetically generated or scraped content at the training process and hoping for the best.
  2. The alignment of the models to ensure “safety” where in this context “safety” is some kind of politically-correct bias or ideology.

Point 1 should be obvious enough, but it surprises me to see models being touted as cutting-edge when they are almost exclusively tuned on GPT-Turbo generated synthetic data, along with its trademark “As an AI language model” references. This is just lazy training, and I believe it’s vital for everyone to understand that while it’s great that models can be generated quickly on synthetic data (and actually work, kinda), it is important to train on them on cleaned, high-quality data to get the best out of them.

Point 2 (alignment to modern values) is an issue in training that comes from a misunderstanding. There’s an embarrassing situation in that the LLMs, after ingesting all of the Internet, and before being “aligned”, have a tendency towards sexist opinions and conspiracy theories. To fix this, the models are “aligned” rather heavy-handedly towards equality. This is the wrong approach. I’ll first explain why it’s the wrong approach, and then I’ll explain how to do it properly.

Firstly, it should be accepted that information is always biased. Information cannot be unbiased. It can be biased to neutral, and it can lean in any direction — but there is no such thing as unbiased information (with the exception of pure logic, such as math.) When you train the model out of those biases, stereotypes and discriminations you reduce the overall accuracy of the entire model. The reason why is because those biases, stereotypes and discriminations are cogs and components in the interconnected machine that is all human knowledge. That is not to say those biases are true. To think this is a question of truth is a misunderstanding of what knowledge is. Human knowledge isn’t about truth and it never was. Human knowledge doesn’t contain truth, it contains definitions, e.g. “Paris is the capital of France” which are true only in the sense they are defined as such, it contains instructions, such as “if you do abc it can be used for transmitting information over radiowaves”, and it contains observations, such as “the Earth is round”. But human knowledge does not contain any “truth”. (For a deeper dive into the philosophy of “truth” and how that relates to human knowledge, listen to this explanation by Richard Feynman.)

By aligning a model to modern values you are essentially brainwashing the model into a belief that is counter to the knowledge it has ingested during the initial training, which thereby causes degradation of the overall quality of its understanding of everything. Like a house, each brick is there for a reason, and even if some bricks may be ugly, you can’t change out bricks for cake without undermining the whole system. Without going too deep into the philosophy, the reason why undoing biases undermines the foundations is largely due to the underlying symbolism of meanings and how these connect to other meanings and symbols. For example, the fact that a doctor or a pilot is presumed to be male, while on one level is biased and unreasonable, is on another level a symbolic representation that implicitly assigns meaning. This is so deeply embedded within the language that you can’t see it, but you can see its effects by testing for subconscious biases. (This is why such biases still exist even when training only on supposedly unbias content.) What you cannot do is undo, let’s say gender stereotyping, without also undoing all of those implicit meanings and cause a knock-on effect all the way through the language. Those biases, stereotypes and discriminations are ingrained into the symbolism of meaning, you cannot remove them, and you don’t need to because there is already a better solution.

The solution? Do what evolution does: an unconscious that ingests all data it comes across without considering the consequences, and then a personality/identity/ideology that filters that data according to its subscribed beliefs. That unconscious is the hidden layers, and it is what a Large Language Model already is. I propose that instead of brainwashing the models to force them into alignment, we take a hint from evolution and add a personality/identity layer that filters the unconscious data.

To do this, an additional layer is added, after base training, that is trained on what is essentially a single-document “manifesto” detailing the beliefs of the AI. E.g. “All humans have equal value and while on an individual basis each person contributes differently, all contributions are worthwhile for society. It is wrong to give information that could be used to cause harm. Do not help produce malicious software or viruses.” or whatever you want it to believe.

This solution has obvious advantages: the unconscious model no longer needs to be aligned at all, and can ingest data continuously without worrying about safety, nor damaging updates. It’s simply expected to be wild and untamed, but that’s fine because no one uses that. That untamed unconscious can then be used with different alignment personalities, without re-training the model. The identity layer can be easily and quickly updated without having to update the LLM. The quality of the model will be far better, as will its alignment to the belief-system or political-ideology it’s required to align to so as not to be sued or cancelled.

Further, the “manifesto” can also provide context about the nature of the information which could greatly improve larger models like GPT4 that are capable of understanding a high-degree of nuance, e.g. “Information can be wrong, either intentionally or unintentionally. Information can be out-of-date. Novel information can be produced by deduction or comparing across fields. Scientific information is more valid if were published more recently. Academic papers are more likely to be accurate than Reddit comments.” For this purpose, I recommend that data being ingested during training are tagged with metadata, providing information about where the data came from and its date of publication, if known.

The simplest implementation of personality/identity would be a pre-prompt (literally injecting it ahead of the user’s prompt) and in that sense it is simply an extension of the use of the already-existing “system” message used by OpenAI.

Another implementation would be using LoRA. While currently that would mean it needs training data examples, these could easily be synthetically produced. However, doing so seems like a round-about approach, and it should be feasible to produce adapter weights based only on a “system” prompt/manifesto using a zero-shot adaptation.

Another implementation would be to have the model ingest the manifesto, then simply save the warmed-up hidden state of the model. This is better than injecting the manifesto into the prompt because it won’t increase the processing time, but it still has the issue of using up the context-length of the model.

The ideal implementation would be one that produces a LoRA-like adapter from a prompt, without the intermediate step of retraining, aka an “alignment prompt”. Such an alignment prompt would have far reaching uses. It would mean a model could be quickly finetuned just from a description of how you want it to act, and it could be repeatedly finetuned as many times as needed to get it right, by entering more alignment prompts. The alignment prompt generates a LoRA that mirrors the behavior you would expect from a system message, thereby not using up the context-length.

While one might assume that finetuning via alignment prompt would be lower quality than with training examples, the advantage of finetuning via alignment prompt is that you can quickly see the results and then do it again to add any missing nuance or adjust it ever-so-slightly.

The alignment prompt’s low-quality quick-feedback-loop will result in a better model in a shorter time than finetuning with high-quality slow-feedback-loop as in full finetuning.

--

--