Is there any work being done in training LLMs on more restricted formal languages? Something like a constraint solver or automated theorem prover, but much lower level. Specifically something that isn't natural language. That's the only path I could see towards reasoning models being truly effective
I know there is work being done with e.g. Lean integration with ChatGPT, but that's not what I mean exactly -- there's still this shakey natural-language-trained-LLM glue in the driver's seat
Like I'm envisioning something that has the creativity to try different things, but then JIT compile their chain of thought, and avoid bad paths
If I understand your idea correctly, I don't think a "pure" LLM would derive much advantage from this. Sure, you can constrain them to generate something syntactically valid, but there's no way to make them generate something semantically valid 100% of the time. I've seen frontier models muck up their function calling JSON more than once.
As long as you're using something statistical like transformers, you're going to need deterministic bolt-ons like Lean.
I wholeheartedly disagree. Logic is inherently statistical due to the very nature of empirical sampling, which is the only method we have for verification. We will eventually find that it's classical, non-statistical logic which was the (useful) approximation/hack, and that statistical reasoning is a lot more "pure" and robust of an approach.
> My personal insight is that "reasoning" is simply the application of a probabilistic reasoning manifold on an input in order to transform it into constrained output that serves the stability or evolution of a system.
> This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.
I've been writing and working on this problem a lot over the last few months and hopefully will have something more formal and actionable to share eventually. Right now I'm at the, "okay, this is evident and internally consistent, but what can we actually do with it that other techniques can't already accomplish?" phase that a lot of these metacognitive theories get stuck on.
> Logic is inherently statistical due to the very nature of empirical sampling, which is the only method we have for verification.
What? I'm sorry, but this is ridiculous. You can make plenty of sound logical arguments in an empirical vacuum. This is why we have proof by induction - some things can't be verified by taking samples.
I'm speaking more about how we assess the relevance of a logical system to the real world. Even if a system is internally self-consistent, its utility depends on whether its premises and conclusions align with what we observe empirically. And because empirical observation is inherently statistical due to sampling and measurement limitations, the very act of verifying a logical system's applicability to reality introduces a statistical element. We just typically ignore this element because some of these systems seem to hold up consistently enough that we can take them for granted.
> there's no way to make them generate something semantically valid 100% of the time.
You don't need to generate semantically valid reasoning 100% of time for such an approach to be useful. You just need to use semantic data to bias them to follow semantically valid paths more often than not (and sometimes consider using constraint solving on the spot, like offloading into a SMT solver or even incorporating it in the model somehow; it would be nice to have AI models that can combine the strengths of both GPUs and CPUs). And, what's more useful, verify that the reasoning is valid at the end of the train of thought, and if it is not, bail out and attempt something else.
If you see AI as solving an optimization problem (given a question, give a good answer) it's kind of evident that you need to probe the space of ideas in an exploratory fashion, sometimes making unfounded leaps (of the "it was revealed to me in a dream" sort), and in this sense it could even be useful that AI can sometimes hallucinate bullshit. But they need afterwards to come with a good justification for the end result, and if they can't find one they are forced to discard their result (even if it's true). Just like humans often come up with ideas in an irrational, subconscious way, and then proceed to rationalize them. One way to implement this kind of thing is to have the LLM generate code for a theorem prover like Coq or Lean, and then at the end run the code - if the prover rejects the code, the reasoning can't possibly be right, and the AI needs to get back to the drawing board
(Now, if the prover accepts the code, the answer may still be wrong, if the premises were encoded incorrectly - but it would still be a net improvement, specially if people can review the Coq code to spot mistakes)
I think that would be a fundamental mismatch. LLMs are statistical and lossy and messy, which is what (paradoxically) permits them to get surprisingly-decent results out of messy problems that draw upon an enormous number and variety of messy examples.
But for a rigorously structured language with formal fixed meaning... Now the the LLM has no advantage anymore, only serious drawbacks and limitations. Save yourself millions of dollars and just write a normal parser, expression evaluator, SAT solver, etc.
You'll get answers faster, using fewer resources, with fewer fundamentally unfixable bugs, and it will actually be able to do math.
How would that be different from something like ChatGPT executing Lean? That's exactly what humans do, we have messy reasoning that we then write down in formal logic and compile to see if it holds.
In my mind, the pure reinforcement learning approach of DeepSeek is the most practical way to do this. Essentially it needs to continually refine and find more sound(?) subspaces of the latent (embedding) space. Now this could be the subspace which is just Python code (or some other human-invented subspace), but I don't think that would be optimal for the overall architecture.
The reason why it seems the most reasonable path is because when you create restrictions like this you hamper search viability (and in a high multi-dimensional subspace, that's a massive loss because you can arrive at a result from many directions). It's like regular genetic programming vs typed-genetic programming. When you discard all your useful results, you can't go anywhere near as fast. There will be a threshold where constructivist, generative schemes (e.g. reasoning with automata and all kinds of fun we've neglected) will be the way forward, but I don't think we've hit that point yet. It seems to me that such a point does exist because if you have fast heuristics on when types unify, you no longer hamper the search speed but gain many benefits in soundness.
One of the greatest human achievements of all time is probably this latent embedding space -- one that we can actually interface with. It's a new lingua franca.
DeepSeek's approach with R1 wasn't pure RL - they used RL only to develop R0 from their V3 base model, but then went though two iterations of using current model to generate synthetic reasoning data, SFT on that, then RL fine-tuning, and repeat.
fwiw, most people don't really grok the power of latent space wrt language models. Like, you say it, I believe it, but most people don't really grasp it.
I know there is work being done with e.g. Lean integration with ChatGPT, but that's not what I mean exactly -- there's still this shakey natural-language-trained-LLM glue in the driver's seat
Like I'm envisioning something that has the creativity to try different things, but then JIT compile their chain of thought, and avoid bad paths