LLMs: Let The Model Speak For Itself. 'Self-Critique'ing.

Borrow2Share
20 hours ago
2 min read

Updated: 5 minutes ago

On The State of 'AI'. It Seems 'AI' LLM Models Have Acquired And Demonstrated Abilities Far Beyond What Their Creators Had In Mind Or Intended. Far Beyond Just Attention, ie. the 'Attention Is All You Need' Transformers/Attention-Based Architecture.

ChatGPT Means (G)enerative (P)re-Trained (T)ransformer.

Google's Gemini 3 Is On Par With ChatGPT Now.

So In Machine Deep Learning, Attenion Is An Overloaded Word With A Completely Different Meaning Than Our Everyday Usage. I Have Tripped Over This Disparate Meaning. And, This Took Me So Much Longer Than It Should Have To Really Understand The Concept and Its Essence In Deep Learning.

(1) Attention Means How One Word Attends To, Helps Another, Alters Another Nearby Word's Meaning. It Also Provides Context For That Nearby Word.

(2) And So, Surrounding Words Near and Far Give Additional Meaning/Context To A Target Word. And, Vice Versa. The Target Word Can Give More Meaning/Context To Other Words Near and Far From It.

(3) Words Mutually Giving Each Other More Meaning and Context Creates This Rich Web of Invisible Connections Among All The Words. They Call This Attention (in Words) in Deep Learning.

'AI' LLM Models Have Now Become *Magical* Black Boxes We Have To Probe It With Benchmarks and Prompts To Study, Evaluate, and Understand It. Just Like An Unknown Species Analogy. Shocked. Amazed. Awed. Delightful.

It's True. If You Don't Use 'AI', Someone Who Uses It Will Perform Better Than You.

It's Like Someone Who Uses Performance-Enhancing Substances To Run, Bike, and Swim Longer and Faster Than You. It's Bad Analogy, But Hits The Point!

I Couldn't Resist. I Don't Endorse or Engage In Illegal Activities!

Mixture-of-Experts MoE LLMs: Key Concepts Explained

https://neptune.ai/blog/mixture-of-experts-llms

MoE Architecture: Scaling A LLM Model To Trillion+ Parameters

"LLM Performance vs. MoE LLM Performance

Before we wrap up, let’s take a closer look at how MoE LLMs compare to standard LLMs:

MoE In Training:

MoEs are faster, and thus less expensive, to train. The Switch Transformer authors showed, for example, that the sparse MoE outperforms the dense Transformer baseline with a considerable speedup in achieving the same performance. With a fixed number of FLOPs and training time, the Switch Transformer achieved the T5-Base’s performance level seven times faster and outperformed it with further training.

MoE During Inference or Use:

MoE models, unlike dense LLMs, activate only a portion of their parameters. Compared to dense LLMs, MoE LLMs with the same number of active parameters can achieve better task performance, having the benefit of a larger number of total trained parameters. For example, Mixtral 8x7B with 13 B active parameters (and 47 B total trained parameters) matches or outperforms LLaMA-2 with 13 B parameters

on benchmarks like MMLU, HellaSwag, PIQA, and Math."

🛋️For The Creatives Unleashed!✍️. 🚙🎶...

LLMs: Let The Model Speak For Itself. 'Self-Critique'ing.

Recent Posts

Comments