In this course, AI is defined not by whether a machine "thinks" — but by whether it produces intelligent behavior through a chain of engineering decisions.
The system acquires raw input from its environment — pixels from a camera, text from a customer message, sensor readings, clicks.
Cat feeder: a camera captures a face at the bowl — raw pixel values, nothing more.
Return chatbot: a customer types "my drill is busted" — a string of characters the system cannot yet act on.
Raw input is transformed into a structured form that computation can act on. The unifying intuition: a representation is a compression that preserves what matters for the task and discards what does not. The task defines what "matters."
Key insight: the goal defines the representation. A "cat detector" (cat: yes/no) cannot distinguish Maurice from Garfield. A different goal — whose cat is this? — requires a completely different representation, even with identical input pixels.
Turning raw input (pixels, words, audio) into something a model can use is called feature engineering. There are two broad approaches, and the tradeoff between them comes up constantly in practice.
A human decides what to measure. For a photo of a cat at the feeder, you might extract:
brightness: 0.72
contrast: 0.88
face_width_px: 142
ear_pointiness: 0.91
fur_color_rgb: [180,140,90]
Each number has a meaning you can explain and debug. The limit: you can only capture what you thought to measure. Variation you did not anticipate — a new angle, different lighting — may break the system.
For text: words are mapped to vectors where similar meanings land nearby. "Broken," "damaged," and "defective" all cluster together — so the system recognizes that "my drill is busted" and "I received a faulty drill" express the same intent, even though they share no words.
For video: each frame is encoded as an image representation; the audio track is converted to a spectrogram (a visual map of sound frequencies over time) and treated similarly. Representations across frames are combined to capture motion and sequence.
A simplified, explicit structure that supports prediction or decision. Every AI system is a model — a selective simplification. What is left out matters as much as what is included.
Cat feeder: "Is this a cat?" — two decision points, two outcomes. Simple flowchart.
Return chatbot: a sequence — is this a return request? is the order valid? does the photo match? — leading to an action.
A sequence of steps that operates over the representation, respects the constraints, and produces an output. For the cat feeder: compare input embedding to registered embeddings → find nearest match → check confidence → check feeding schedule → output a decision. Each step is computable because the representation made it computable.
In learning-based systems, the weights inside the algorithm are learned from data — not written by a human. That is the key difference from a hand-coded rule system.
Rules that limit which outputs are allowable. A return chatbot cannot issue a label for an order outside the 90-day window — but only if that constraint is encoded and the representation carries the required fields.
Key insight: constraints drive representation design. If you need to enforce a rule, your data must carry the fields that make the rule checkable.
Cat feeder: Maurice is recognized — but last_fed: 1h ago means eligible: no. Identity alone is not enough. The no-overfeeding constraint required adding new fields to the representation.
Return chatbot: return window = 90 days; product must match the order. Both constraints require specific fields in the representation to be enforceable at all.
What the system does over time — the observable output of the entire pipeline. Maurice gets fed. Garfield is blocked. A return label is issued or denied. Behavior is what users experience; everything else is invisible infrastructure.
- Video · 20 min3Blue1Brown — But What Is a Neural Network? — best visual intuition for how representations and models connect
- ArticleHBR — AI for the Real World — frames AI as a tool for specific tasks, not general intelligence
It helps to distinguish between two things that often get conflated: a large language model (LLM) and a deployed AI system like ChatGPT.
An LLM is a model that predicts the next token — the next word, or piece of a word — based on patterns in billions of text examples. It does not reason, retrieve, or act. It generates plausible continuations of text. The output feels like understanding because human language is the training data. But ask it to do precise arithmetic or recall something after its training cutoff, and it fails — because those tasks require something other than pattern completion.
ChatGPT is a full AI system built on top of an LLM. It adds tool use (web search, code execution, calculators), memory, safety filters, and other components. In the course framework: the LLM is the model/algorithm layer, and ChatGPT wraps it in perception (your message), constraints (safety and policy layers), and behavior (response plus actions taken by tools). Saying "ChatGPT thinks" conflates the underlying model with the full system — and misses where the real engineering decisions live.
Yes, deliberately. Whether machines are "really" intelligent is a philosophical question with no operational answer. The engineering definition is useful precisely because it is narrow: it tells you what you need to build, what can go wrong, and how to evaluate whether a system is working. For business purposes, behavior is what matters — if the system produces the right outputs under the right conditions, it is doing its job, regardless of inner experience.
The line is the perception-action loop. A regression model applied to a spreadsheet produces an output that a human then acts on — the model is not in the loop. A recommendation engine perceives a user's behavior in real time, represents it, models preferences, and directly changes what that user sees — closing the loop between perception and action. The distinction is not about the algorithm; it is about whether the system itself is embedded in the environment it affects.
- Short readIEEE Spectrum — The Turbulent Past and Uncertain Future of AI — survey of definitions from practitioners
- ClassicTuring, A. (1950). Computing Machinery and Intelligence. Mind, 59(236), 433–460. — the original "can machines think?" paper; surprisingly readable
| Dimension | Rule-Based | Learning-Based |
|---|---|---|
| How it works | Human encodes knowledge as explicit IF-THEN rules | System induces patterns from labeled examples |
| Works well when | Domain is small, stable, and fully enumerable | Domain is large, variable, or hard to articulate |
| Fails when | World changes faster than rules can be updated; input varies infinitely | Training data is biased, scarce, or mislabeled |
| Transparency | Fully auditable — every rule is readable | Often opaque — learned weights are not interpretable |
| Marketing example | Phone tree: Press 1 for returns, Press 2 for status | Intent classifier: maps "my drill is busted" → start_return |
This is a genuinely good question, and the honest answer is: it depends on how you define AI — and the definition keeps moving. By the engineering framework from class (perception → representation → model → behavior), a phone tree qualifies: it perceives input (button presses), applies a model (the menu script), and produces behavior (routing you to the right department). It is a very simple AI system.
But most people instinctively feel it is not AI. This intuition has a name: the AI effect. Once a technology becomes routine and understood, we stop calling it AI and start calling it "just software." Expert systems that could diagnose infections better than junior doctors were called AI in 1985. Today we would call them decision trees. The same technology, reclassified. Wikipedia — The AI Effect →
What makes the phone tree feel like "not AI" is that it has no learning, no generalization, and no ability to handle inputs outside its explicit menu. These are real limits — and they are exactly why learning-based systems replaced rule-based ones for open-ended tasks. But the framework applies to both. Recognizing this helps you evaluate any system someone calls "AI," not just the impressive ones.
This is one of those cases where the popular diagram — a set of nested circles with AI on the outside and ML inside — is convenient but not quite right. Two points worth keeping separate:
AI does not require machine learning. Expert systems — rule-based programs that encode human knowledge as IF-THEN logic — were the dominant form of AI from the 1970s through the late 1980s. They had no learning component whatsoever. A phone tree, a chess engine with hand-coded evaluation functions, or a medical diagnosis system built from clinical rules can all produce intelligent behavior without learning anything from data. Saying ML is necessary for AI would make all of those not AI — which is historically and practically wrong.
Not all machine learning is AI. Using a neural network to predict next quarter's sales from a spreadsheet — with a human looking at the output and deciding what to do — is machine learning. But by the framework from this course, it is not necessarily AI: there is no perception loop, no environment the system acts on, no behavior it produces autonomously. It is a sophisticated statistical model. Useful, but not the same thing. The moment that model is embedded in a system that perceives customer behavior and automatically adjusts pricing — now it is part of an AI system.
The cleaner framing: AI and ML overlap, but neither contains the other. Some AI uses ML. Some ML is part of an AI system. Both can exist without the other. What makes something AI is not the algorithm — it is whether the system closes the loop between perception and action in the world.
Both approaches coexist in practice. Many deployed systems layer them: a rule-based component enforces hard policy constraints while a learning-based component handles open-ended input.
- ArticleIBM — What Are Expert Systems?
- Short readHBR — The Simple Economics of Machine Intelligence (Agrawal, Gans, Goldfarb)
Machine learning estimates a function f from data: given inputs, produce outputs. The goal is not to memorize training examples but to generalize — to produce correct outputs on inputs the model has never seen.
Garfield wearing a cardboard mask has never appeared in training data. The raw pixels look nothing like any training photo. But the face embedding is still close enough to "Garfield" that the model correctly identifies and blocks him. That is generalization.
Every supervised learning system requires labeled training data. Someone looked at each example and assigned the correct output. At scale, this is done by paid annotation workers via platforms like Amazon Mechanical Turk or Scale AI. Labels are not free, not perfect, and not neutral.
Google Photos, 2015. The image classifier labeled photos of Black users as "gorillas." The root cause was not a broken algorithm — it was training data that did not include sufficient diversity of faces, combined with annotators who lacked the guidance and context to label fairly. Google removed the gorilla category entirely rather than fix it. The Guardian, 2015 →
Hiring algorithms. Multiple companies trained resume-screening systems on historical hiring data. Because most historical hires were male, the model learned to penalize resumes with signals associated with women — including attending women's colleges. The annotators did not intend this; they labeled accurately. The bias was in the data itself. Reuters — Amazon's scrapped AI recruiting tool →
Medical imaging. A dataset intended to classify chest X-rays was labeled by radiologists in a single country. The model performed well there and poorly in other regions — not because the algorithm failed, but because the annotators' definitions of "abnormal" reflected one clinical context. Nature Medicine — Underdiagnosis bias in AI systems →
Cost as a constraint. For a catalog of 10,000 products with 50 photos each, that is 500,000 individual labeling judgments. At a few cents per label, this costs tens of thousands of dollars — before quality checks. Companies routinely cut annotation budgets, which means more rushed decisions, more edge cases guessed rather than escalated, and more noise in the training data.
Training is expensive — requires labeled data, compute, and engineering infrastructure. It happens offline, periodically. Prediction is cheap — the trained model is just a function evaluated on new input in milliseconds. Most deployed systems are trained periodically and then frozen.
Representative: covers the range of inputs the model will actually see in deployment — not just the clean, easy cases.
Accurately labeled: edge cases are where label quality degrades most.
Large enough: more variation in the world requires more examples.
Recent enough: product lines change; old training data may not reflect current inputs.
- Video · 20 min3Blue1Brown — What Is a Neural Network?
- ArticleMIT Tech Review — The Humans Behind AI's Data
- InteractiveGoogle ML Crash Course — Overfitting
| Type | Feedback during training | What it learns | Marketing use |
|---|---|---|---|
| Supervised | Labeled examples (input + correct output) | A function mapping inputs to labels or values | Churn prediction, click probability, intent classification, product image matching |
| Unsupervised | None — just the data itself | Structure and groupings in the data | Customer segmentation, topic discovery in support tickets, anomaly detection |
| Reinforcement | Rewards/penalties from the environment | A policy: what action to take in each state | Real-time ad bidding, recommendation engines, chatbot escalation policy |
The most common type of ML in marketing applications. You provide labeled examples — (input, correct output) pairs — and the system learns a function that maps new inputs to outputs it has never seen. The word "supervised" refers to the fact that human judgment is baked into every label.
There are two main tasks: classification (predicting a category — churn/no churn, match/no match, which intent) and regression (predicting a continuous value — predicted lifetime value, optimal bid price, expected revenue). The algorithm is different; the logic is the same.
Spam filters. Gmail's spam classifier is trained on billions of labeled emails — spam or not spam — marked by users over years. When you mark something as spam, you are contributing a training label. The model learns which combinations of sender, subject, content, and metadata predict spam. Google's approach to email classification →
Netflix recommendations. Netflix's recommendation system uses supervised classification to rank content for each user. Labeled training data — which titles a user watched, completed, or skipped — feeds classifiers including logistic regression, support vector machines, neural networks, and gradient boosted decision trees. Each model learns to predict which content a user will engage with. The label is implicit behavior, not a human rating. Netflix Tech Blog — Beyond the 5 Stars (Part 2) →
From class — Maurice and the return chatbot. Both are classification problems. The feeder maps face embeddings to {Maurice, not Maurice}. The chatbot maps message embeddings to {start_return, check_status, damaged_item, …}. Same structure, different representations.
→ Go deeper: 3Blue1Brown — Neural Networks (Video · 20 min) — shows visually how a network learns to classify handwritten digits, which is structurally identical to learning to classify Maurice vs. not Maurice. Start here if the idea of "learning from examples" still feels abstract.
No labels. No correct answers. The system finds structure in data on its own — groupings, patterns, anomalies — without being told what to look for. The most common task is clustering: partitioning examples into groups that are similar within and different across. Other tasks include dimensionality reduction (compressing a high-dimensional representation into something visualizable) and anomaly detection (finding examples that do not fit any pattern).
The key shift: in supervised learning, a human defines the categories in advance. In unsupervised learning, the algorithm proposes the categories and a human decides whether they are meaningful.
Clustering at scale. Spotify uses clustering to group songs, artists, and users by similarity — without predefined categories. Each user's listening history is compressed into an embedding vector representing their position in "taste space." Users close together in that space receive similar recommendations. The clusters emerge from the data; no one decided in advance what the groups should be. Spotify Engineering — Data Science →
Topic discovery in customer feedback. Rather than tagging support tickets by hand into predefined categories, companies run topic modeling (a form of unsupervised learning) on tens of thousands of tickets to surface recurring themes. The algorithm finds that "charger," "battery," and "won't turn on" cluster together before any human labels them as "power issues."
From class — the chatbot intents. The return chatbot was built with six defined intent categories. Clustering 20,000 real conversations revealed five unanticipated categories: contractor bulk returns, gift returns, partial returns, return modifications, and compensation requests. The supervised classifier was silently mishandling all five — because the label set reflected what the design team anticipated, not what customers actually did.
No labels, no fixed dataset. An agent takes actions in an environment, receives a reward signal (positive or negative), and learns a policy — a mapping from situations to actions that maximizes cumulative reward over time. The agent is not told the right action in advance; it discovers it through trial and error.
The central tradeoff is explore vs exploit: should the agent take the action that has worked best so far (exploit), or try something new that might work better (explore)? Too much exploitation means the agent gets stuck in a local optimum. Too much exploration means it never capitalizes on what it has learned.
Starbucks Deep Brew. Starbucks uses reinforcement learning inside its mobile app to personalize drink and food recommendations for 16 million Rewards members. The agent learns which suggestions each customer is most likely to accept — based on order history, time of day, weather, and local store inventory — and updates its policy continuously from real purchase feedback. If a customer consistently orders dairy-free, the system learns to stop recommending anything with dairy without being explicitly programmed to do so. Microsoft Source — Starbucks Deep Brew →
Spotify personalization. Spotify uses reinforcement learning to optimize what it surfaces to each listener. The system learns a policy — which tracks, playlists, or podcasts to recommend in which order — from reward signals like whether a user plays, skips, or saves content. Rather than being told in advance what "good" looks like, the agent discovers it through continuous interaction with millions of listeners. Spotify Engineering — ML for personalization →
Real-time ad bidding. In programmatic advertising, the system must decide in milliseconds how much to bid for each impression. RL learns a bidding policy from the reward signal of conversions — bid too high and you overspend; bid too low and you lose the impression. The policy balances cost and outcome continuously across millions of auctions per day.
Reward hacking — promotional discounts. A retailer trains an RL agent to maximize short-term conversion rate. The agent learns that offering steep discounts on every interaction drives purchases — which is true. But customers learn to wait for discounts and stop buying at full price. The system optimized exactly what it was told to optimize. Long-term margin and brand equity were not in the reward function. This is reward hacking: technically correct behavior that violates the actual business intent.
From class — the cat feeder. Rather than fixed feeding times, an RL feeder learns Maurice's actual hunger rhythm. It tries dispensing at a new time (explore), observes whether Maurice eats (reward), and updates its policy. Over time it learns that Maurice is hungry at 6:47am, not 7:00am — something no rule could have specified in advance.
→ Go deeper: Sutton & Barto — Reinforcement Learning: An Introduction, Chapter 1 (free online) — written for a general audience; the explore/exploit framing is explained clearly with no math required in the first chapter.
| Topic | Resource | Format |
|---|---|---|
| AI definition & framework | HBR — AI for the Real World | Article · 15 min |
| Philosophy of AI | IEEE Spectrum — The Turbulent Past and Uncertain Future of AI | Article · 10 min |
| Philosophy of AI | Turing (1950), Computing Machinery and Intelligence | Classic paper |
| Rule-based vs learning | HBR — Simple Economics of Machine Intelligence | Article · 10 min |
| Neural networks | 3Blue1Brown — Neural Networks | Video · 20 min |
| Data annotation & bias | MIT Tech Review — Humans Behind AI's Data | Article · 12 min |
| Overfitting / generalization | Google ML Crash Course — Overfitting | Interactive · 15 min |
| Supervised learning | 3Blue1Brown — Neural Networks | Video · 20 min |
| Supervised learning | Netflix Tech Blog — Beyond the 5 Stars (Part 2) | Article |
| Unsupervised learning | Spotify Engineering — Data Science | Article |
| Reinforcement learning | Spotify Engineering — ML for personalization | Article |
| Reinforcement learning | Microsoft Source — Starbucks Deep Brew | Article |
| Reinforcement learning | Sutton & Barto — RL: An Introduction, Ch. 1 | Free online |