Tag Archives: large language model

On Large Language Models

The Philosophy

Why are they so intelligent? The moment we speak of intelligence, we step into the arena of philosophy. Intelligence presupposes consciousness and awareness, doesn’t it? Or does it? You see, Artificial Intelligence is fraught with philosophical possibilities.

As teachers and writers, we often distinguish among various categories or levels of knowledge, creating hierarchies like Data, Information, Knowledge, and Wisdom, with insights and creativity likely lurking between the last two. I remember using this framework while teaching text analytics, explaining to my students that as one ascends these higher orders, the information density increases. I illustrated this by showing that two numbers—the mean and standard deviation of their scores—could encapsulate the essential performance of the cohort. What I was subtly implying, of course, was that my own place in this hierarchy was closer to the Wisdom end, where highly distilled information is infused with creativity and intellect to yield a neatly packaged product: wisdom.

What large language models (LLMs) such as ChatGPT suggest, however, is quite the opposite. The process of creating intelligence—or wisdom, depending on your preference—is not one of distillation and concentration but of granulation. In fact, the entire hierarchy from Data to Wisdom may be fundamentally flawed, at worst, or misleading, at best. Allow me to explain.

In my earlier statistical example, where the mean and spread summarize the cohort’s performance, I could make the model generative. For instance, I could predict a new student’s score by assigning the mean value in the absence of any other information. Alternatively, I could draw a score randomly from a normal distribution defined by the given mean and standard deviation.

When people say that LLMs are merely “predicting the next word,” they are essentially assuming the latter: that LLMs determine the most probable next word—akin to assigning the mean score to the new student. A more nuanced practitioner might argue that the LLM generates a random word from a statistical model, much like assigning a random score based on the normal distribution. Of course, the word-prediction process is far more complex: The model is “large,” and predictions depend heavily on the context of the conversation.

To build on my toy model, I could create sub-models for specific groups—such as males and females, tall and short students, or individuals of different nationalities and backgrounds—to improve prediction accuracy. In my example, however, segmentation reduces statistical power because the data set becomes too fragmented. For language models, on the other hand, segmentation enhances accuracy. Precisely because they are “large,” LLMs do not suffer from statistical power loss. Instead, their predictions improve. In essence, the more granular the model, the better its performance. But this granularity seems to contradict the traditional Data-Information-Knowledge-Wisdom hierarchy. After all, a fully segmented model is equivalent to the data itself, isn’t it? Does this not suggest that the hierarchy is flawed?

So much for this quasi-philosophical exploration of how LLMs work. Let us now turn to why they appear so intelligent, smart, or wise—or at least, knowledgeable. Ultimately, all these terms may point to the same phenomenon.