AI & Marketing
HubSpot & Motion AI: Chatbot-Enabled CRM (HBS #518-067)
Paper and pencil. Closed notes. 5 minutes.
Quizzes will be collected before discussion begins.
No makeups given.
Lowest quiz grade across all three cases is dropped.
The situation — September 2017
HubSpot acquires Motion AI.
Motion AI has built 80,000 bots for brands including T-Mobile, Kia, and Sony.
HubSpot’s own chat reps currently handle lead qualification and funnel conversion for its B2B sales process.
The question on the table:
Replace them — or not?
Pick a side.
Not “it depends.” Not “both.”
Yes — replace human chat reps with chatbots.
No — keep humans in the process.
We will add nuance in a moment. First, give me a position.
The numbers
It takes 2,500 visitors to yield 1 customer.
Where would you put a bot?
Where would you never put one?
Walk me through your logic at each stage.
The numbers
It takes 2,500 visitors to yield 1 customer.
| Stage | Bot vs. human |
|---|---|
| 🔝 ToFu — Attract & Identify | Bot advantage. Volume too high for humans. Customers not ready for salespeople. 24/7 availability can raise the 4% self-identification rate. |
| 〰️ MoFu — Educate & Nurture | Mixed. Bots handle simple FAQs; handoff required as questions grow in complexity and specificity. |
| 🔻 BoFu — Qualify & Close | Human advantage. 45-day consultative cycle. $2,400/month product. Bots do triage only — scheduling, pre-qualification. |
| ✅ Post-sale — Onboard & Retain | Contested. Bots reduce service cost. Humans may reduce churn. SaaS model makes getting this wrong expensive. |
Cost to acquire 1 customer — with human chat reps
| Line item | Calculation | Cost |
|---|---|---|
| Lead generation | $50/lead × 100 leads | $5,000 |
| Salesperson | $120K ÷ 12 months ÷ 5 customers/month | $2,000 |
| Chat rep | $60K ÷ 12 ÷ 5 × 3 reps per salesperson | $3,000 |
| Total | $10,000 |
100 leads → 40 qualified → 5 demos → 1 customer
Cost to acquire 1 customer — with AI chatbot
| Line item | Calculation | Cost |
|---|---|---|
| Lead generation | $50/lead × 100 leads | $5,000 |
| Salesperson | Same as above | $2,000 |
| Chatbot | One-time fixed cost | $0 |
| Total | $7,000 |
Savings: $3,000 per customer acquired
Reinvested at $50/lead → 60 more leads → 0.48 incremental customers
Net result: 1.28 customers for the same spend
Despite a 20% drop in conversion rate (10% vs. 12.5%), bots produce a 28% gain in customers per dollar.
The formula
\[CLV = m \times \frac{r}{1-r} - AC\]
(simplified — zero discount rate)
Acquisition costs from the previous slide:
Humans: \(AC = \$10{,}000\)
Bots: \(AC = \$7{,}000\)
Assumptions
| Parameter | Value | Source |
|---|---|---|
| Avg. annual revenue / customer | $11,660 | $271M ÷ 23,226 customers (Exhibit 1) |
| Gross margin | 74% | B2B SaaS median benchmark |
| Annual profit / customer (m) | $8,628 | $11,660 × 0.74 |
| Annual churn rate | 10% | B2B SaaS SMB benchmark |
| Retention rate (r) | 0.90 | 1 − 0.10 |
| Discount rate (d) | 0 | Simplified |
Calculate:
CLV answers from the previous slide
| Humans | Bots | |
|---|---|---|
| \(m\) (annual profit/customer) | $8,628 | $8,628 |
| \(r/(1-r)\) (lifetime multiplier) | 9.0 | 9.0 |
| \(m \times r/(1-r)\) | $77,652 | $77,652 |
| Acquisition cost (AC) | $10,000 | $7,000 |
| CLV | $67,652 | $70,652 |
The $3,000 difference is entirely the acquisition cost saving. Under identical assumptions, bots always win by exactly \(AC_\text{humans} - AC_\text{bots}\).
Now assume bot-acquired customers have weaker relationships.
Q1 — Retention breakeven
How much would annual churn need to rise — for bot-acquired customers only — before bot CLV equals human CLV?
Hint: set bot CLV = $67,652 (human CLV) and solve for \(r^*\)
Q2 — Margin breakeven
How much would annual profit per customer need to fall — for bot-acquired customers only — before bot CLV equals human CLV?
Hint: set bot CLV = $67,652 and solve for \(m - \Delta m\)
Q1 — Retention breakeven
Set bot CLV = human CLV = $67,652 and solve for \(r^*\):
\[m \times \frac{r^*}{1-r^*} - AC_\text{bots} = \$67{,}652\]
\[8{,}628 \times \frac{r^*}{1-r^*} - 7{,}000 = 67{,}652\]
\[8{,}628 \times \frac{r^*}{1-r^*} = 74{,}652\]
\[\frac{r^*}{1-r^*} = 8.652 \quad \Rightarrow \quad r^* = 0.896\]
\[\text{Churn} = 1 - 0.896 = \mathbf{10.4\%}\]
Churn only needs to rise 0.4 percentage points — from 10% to 10.4% — before the bot advantage disappears entirely.
Q2 — Margin breakeven
Set bot CLV = human CLV = $67,652 and solve for \(\Delta m\):
\[(m - \Delta m) \times \frac{r}{1-r} - AC_\text{bots} = \$67{,}652\]
\[(8{,}628 - \Delta m) \times 9.0 - 7{,}000 = 67{,}652\]
\[(8{,}628 - \Delta m) \times 9.0 = 74{,}652\]
\[8{,}628 - \Delta m = 8{,}295\]
\[\Delta m = \mathbf{\$333/\text{year}} \quad (3.9\%)\]
Annual profit only needs to fall $333 per customer per year — just 3.9% — before bots stop being worth it.
The insight
Both breakevens are very small. The $3,000 acquisition saving is fragile on both dimensions.
The risk is not that bot customers leave sooner. It is that they spend less and cost more to serve while they stay.
Three design decisions every company faces
1. Disclose or conceal?
Should customers know they are talking to a bot?
2. Brand voice or customer mirroring?
Should the bot speak in a consistent brand voice — or dynamically adjust its tone to match the customer?
3. Functional UI or conversational UI?
Get things done efficiently — or build a relationship through natural dialogue?
On disclosure — the uncanny valley
People prefer human-like bots — up to a point. When a bot that seems human suddenly fails, the reaction shifts from engagement to revulsion.
“The more human-like a system acts, the broader the expectations — and so do the disappointments.”
In 2017, bots failed 70% of the time and could handle less than 20% of an interaction before handoff.
High human-likeness + high failure rate = the worst possible combination for trust.
Open question: Does disclosure reduce satisfaction — or does it protect it by calibrating expectations?
Three design decisions every company faces
1. Disclose or conceal?
Should customers know they are talking to a bot?
2. Brand voice or customer mirroring?
Should the bot speak in a consistent brand voice — or dynamically adjust its tone to match the customer?
3. Functional UI or conversational UI?
Get things done efficiently — or build a relationship through natural dialogue?
On voice — a genuine tradeoff
Humans naturally mirror their conversational partners — it is a foundation of relationship building.
But mirroring a frustrated customer’s frustration back at them may amplify the problem.
LLMs can detect sentiment in real time and adjust tone dynamically. The capability now exists.
Open question: Does a bot that adapts its tone feel more relational — or more manipulative?
Three design decisions every company faces
1. Disclose or conceal?
Should customers know they are talking to a bot?
2. Brand voice or customer mirroring?
Should the bot speak in a consistent brand voice — or dynamically adjust its tone to match the customer?
3. Functional UI or conversational UI?
Get things done efficiently — or build a relationship through natural dialogue?
On UI — speed vs. warmth
B2B buyers are busy. They often just want the answer. But a purely functional UI is essentially a phone tree with a chat interface.
The key finding: customers want outcome speed, not conversation quality.
A bot that resolves in 47 seconds outperforms a bot that has a warm conversation for 3 minutes and fails.
Open question: Does conversational warmth improve outcomes — or does it just slow resolution down?
These are empirical questions. Your projects test them.
What Motion AI’s bots actually were
Rule-based, with limited ML.
A human coder scripted the decision tree. The bot followed it. When customer input did not match an anticipated path — it failed.
This is exactly the rule-based AI from Day 1.
Customer: "Any way to get a discount?"
Bot: I didn't understand that.
Press 1 for pricing
Press 2 for a demo
Press 3 for a rep
The problem was not the rules.
The problem was that language does not follow rules.
Infinite variation. Ambiguity. Context. Sarcasm. Slang.
No rule set can enumerate all the ways a customer can ask about a discount.
What changed
The chatbot in HubSpot’s 2024 product is not a decision tree.
It is a large language model — trained on billions of documents, capable of responding fluently to inputs no human ever explicitly scripted.
Why does it work where the 2017 bot failed?
The answer is not that someone wrote better rules.
The answer is that someone built a completely different kind of system — one that does not start with rules at all.
We are going to open that system and look inside.
Same framework as Day 1. Same six concepts. One new running example throughout.
The same six concepts from Day 1
On Day 1 we built a framework for understanding any AI system. We used two examples — the cat feeder and the Home Depot return chatbot — to trace how each concept maps onto a real deployed system.
Today we do the same thing for a large language model.
Same framework. New system. One running example throughout.
By the end of this section you will be able to trace exactly what happens to this sentence — from the moment it arrives to the moment the bot responds.
Our running example
A prospect messages HubSpot’s chatbot:
“Is there any way to get a discount before I commit to the annual plan?”
| Day 1 concept | This system |
|---|---|
| Perception | ? |
| Representation | ? |
| Model | ? |
| Constraints | ? |
| Algorithm | ? |
| Action | ? |
The model does not read words.
It reads tokens — integer IDs representing chunks of text.
“Is there any way to get a discount before I commit to the annual plan?”
16 tokens · 70 characters
Each word or word-fragment maps to one integer in the model’s vocabulary. The model never sees letters.
311 appears twice — once for “to get” and once for “to the.” Same ID. Completely different meaning. The model resolves that from surrounding tokens.
Framework — Perception row filled:
| Day 1 concept | This system |
|---|---|
| Perception | Token IDs from the input message |
| Representation | ? |
| Model | ? |
| Constraints | ? |
| Algorithm | ? |
| Action | ? |
The model does not read words.
It reads tokens — integer IDs representing chunks of text.
“Is there any way to get a discount before I commit to the annual plan?”
16 tokens · 70 characters
[3031, 1354, 1062, 2006, 316, 717, 261, 11522, 2254, 357, 8737, 316, 290, 12355, 3496, 30]
The model never sees the word “discount.” It sees 11522.
Every operation from here is linear algebra on these integers — tokenized, then converted to vectors.
Framework — Perception row filled:
| Day 1 concept | This system |
|---|---|
| Perception | Token IDs from the input message |
| Representation | ? |
| Model | ? |
| Constraints | ? |
| Algorithm | ? |
| Action | ? |
This distinction is almost always skipped. It causes more confusion than anything else.
The word “representation” appears twice in how LLMs work — in two completely different roles.
Role 1 — Training representation
During training, billions of sentences from the internet are processed. Text like:
“Customers who ask about discounts before committing to an annual plan are often in the final evaluation stage.”
This is what the model learns from. It builds the model’s knowledge of language, context, and meaning.
During training, the embeddings are being updated. The weights are changing with every correction.
Role 2 — Input representation (Perception)
During inference, the specific message arrives:
“Is there any way to get a discount before I commit to the annual plan?”
This is what the model is perceiving right now. It uses the knowledge built during training to interpret this input.
During inference, the embeddings are fixed. The weights are frozen.
The same mechanism, two different roles
| Training | Inference | |
|---|---|---|
| What | Billions of sentences | Your specific message |
| Role | Build the model | Use the model |
| Weights | Updating | Frozen |
| Day 1 term | Training phase | Prediction phase |
Why this matters for managers
The model does not know HubSpot’s specific discount policy. But it has seen millions of sentences about discounts and annual plans. It learned the statistical patterns of how those conversations unfold.
That is very different from knowing the actual policy.
Each token ID is converted to a vector.
A vector is a list of numbers — a very long one.
For text-embedding-3-small, each token becomes 1,536 numbers.
Here is what the embedding for “Limited-time offer on unbelievably good deals!” actually looks like:
"embedding": [
-0.02082209,
-0.0050799586,
-0.058835678,
0.017880306,
-0.02006153,
0.027121812,
...
-0.0008462113,
0.029016035,
-0.0007385851,
0.06859379,
0.0150533235,
-0.009506985,
0.00739751,
0.018597813
]
1,536 values. Every token. Not assigned by a human. Learned from billions of prediction-correction cycles.
What do these numbers mean?
Each number is a coordinate in a 1,536-dimensional space. The position encodes meaning — learned by predicting text billions of times, not defined by a human.
For our discount question:
discount lands near promo, coupon, offer, dealcommit lands near buy, subscribe, purchaseannual plan lands near subscription, contract“Any discount available?”, “any promos?”, and “is there a deal?” all land in the same neighborhood.
The 2017 bot required a human to script each phrasing. The LLM learned the equivalence from data.
Framework — Representation row filled:
| Day 1 concept | This system |
|---|---|
| Perception | Token IDs |
| Representation | Learned embedding vectors |
| Model / Constraints / Action | next slides |
A bag of vectors is not a sentence.
Each token has an embedding. But “not good” means something completely different from “good” — even though the individual embeddings for “not” and “good” are the same in both cases.
Self-attention is the mechanism that makes context matter.
For each token, the model computes: how much should every other token in this sentence influence my meaning?
For “discount” in our sentence:
| Attends to | Weight |
|---|---|
| “annual plan” | 0.88 |
| “commit” | 0.72 |
| “any way” | 0.52 |
| “before” | 0.36 |
| “Is” | 0.10 |
“Discount” in the context of “annual plan commitment” means something specific — pre-purchase pricing inquiry.
Without attention: “discount” is just a vector. With attention: “discount in an annual plan commitment context” is a richer, context-aware vector.
Why this matters for marketing language
“The plan is not discounted.” vs. “Is the plan discounted?”
Same tokens. Opposite meanings. Self-attention captures the difference because “not” attends strongly to “discounted,” inverting the semantic direction.
Tone, urgency, frustration, sarcasm — all live in the attention relationships between tokens.
“I guess I’ll just go with the monthly plan then.”
“Guess” and “just” signal reluctance and implicit downgrade intent. A model with strong self-attention reads this as a retention risk.
The 2017 bot had no equivalent mechanism.
The same six concepts. A new system.
The steppers just showed you all six rows — in order.
The key insight from training:
The model was never told what “discount” means or that “commit to annual plan” signals buying intent. All of it fell out of predicting text billions of times.
The algorithm forced the knowledge. The representation encoded it geometrically. The model locked it in.
The raw LLM has no guardrails.
The context window limits how much the model can see at once. That is the only structural constraint the raw LLM has.
Nothing prevents it from inventing a discount code, quoting outdated pricing, or making promises that contradict policy.
Guardrails do not exist at the LLM level. They are a system-level addition. Which is exactly what we need to build next.
The completed framework — raw LLM
| Day 1 concept | Raw LLM |
|---|---|
| Perception | Token IDs from the input message |
| Representation | Learned embedding vectors |
| Model | The system logic: perceive → embed → predict → generate |
| Constraints | Context window only — no guardrails |
| Algorithm | Next-token prediction + backpropagation |
| Action | Generated text — one token at a time |
Five things the raw LLM cannot do:
| Limitation | Business consequence |
|---|---|
| Knowledge cutoff | Quotes expired promotions or wrong pricing |
| No memory | Prospect repeats themselves every session |
| Cannot act | “I’ve flagged this” — but nothing was logged |
| Hallucinates | Prospect expects a discount that does not exist |
| No private data | Cannot personalize to this prospect |
The solution is not a smarter LLM. It is a better system around it.
The prospect’s question
“What’s the final price after the 20% discount, with sales tax? I’m in DC.”
What a raw LLM generates:
“With a 20% discount the Professional plan would be $640/month. DC sales tax is 6%, bringing it to $678.40.”
Why this is a problem
The solution: give the system a calculator.
Steps 1–3
1. Perception: Message + tool definition arrive. calculator(expression: string) → float
2. Representation: Context window assembled.
3. Model — LLM pass 1: Predicts a tool call:
{"tool": "calculator",
"input": "800 * 0.80 * 1.06"}
Steps 4–6
4. Algorithm: Calculator runs: 800 × 0.80 × 1.06 = 678.4
5. Representation updated: [TOOL RESULT] calculator → 678.4
6. Model — LLM pass 2:
“With the 20% discount and DC’s 6% SaaS tax, your monthly cost would be $678.40.”
Correct because the calculator verified it.
The simple system solved arithmetic.
It did not solve the actual business problem.
The prospect asked:
“Is there any way to get a discount before I commit to the annual plan?”
What the simple system still cannot answer:
None of these can be answered by a calculator.
They require retrieval, memory, and access to private data.
What needs to be added
Retrieval (RAG) Before the LLM generates anything, the system queries: - Promotions database → is ANNUAL20 active? - Pricing documentation → what are the current tiers? - Policy document → what discounts can be self-served?
Memory The prospect’s conversation history, previous interactions, and account status persist across sessions.
CRM integration This prospect’s lead score, the plan they are evaluating, and whether a sales rep has already engaged.
Guardrails Policy rules enforced before any response is sent: no discount above X% without approval, always escalate if the request exceeds bot authority.
Tools (beyond calculator) Apply a promotion code, log a CRM note, schedule a sales call, send a follow-up email.
The simple system had one tool. The full system has a toolkit.
Same question. Full system.
“Is there any way to get a discount before I commit to the annual plan?”
Step 1 — Retrieval (before LLM)
System detects discount + annual plan intent. Queries three sources:
Promotions DB → ANNUAL20: 20% off,
new customers, expires Apr 30
Pricing docs → Professional: $800/month
Policy doc → Up to 20% self-serve;
above 20% needs approval
CRM record → Lead score: 82
Plan: Professional
Status: new customer
Step 2 — Context injection
All retrieved data injected into context window.
LLM now perceives: question + actual policy + active promotion + prospect’s account status.
Step 3 — LLM generates (grounded)
“Yes — we have a current promotion for new customers on annual billing: 20% off your first year. For the Professional plan, that’s $640/month — or $7,680/year. Would you like me to apply that to your account?”
Not because the LLM knew any of this. Because the system retrieved it and put it in context.
Step 4 — Tool calls (if prospect says yes)
apply_promotion(account_id, "ANNUAL20")
log_crm_note(lead_id, "Discount applied — 20%")
Guardrail check: 20% within self-serve authority. ✓
What changed from the raw LLM:
The LLM generated the words. The system provided the facts, enforced the policy, and took the action.
Raw LLM — what it could do alone
| Day 1 concept | Raw LLM |
|---|---|
| Perception | Token IDs from input only |
| Representation | Learned embeddings |
| Model | Perceive → embed → predict → generate |
| Constraints | Context window only |
| Algorithm | Next-token prediction |
| Action | Generated text only |
Generates plausible text. Cannot verify it. Cannot act. Cannot remember.
Full system — what the architecture adds
| Day 1 concept | Full GenAI system |
|---|---|
| Perception | Input + retrieved context + memory |
| Representation | Embeddings + structured retrieved facts |
| Model | Extended flowchart: retrieve → LLM → tools → guardrail → output |
| Constraints | Context window + policy rules + tool interception |
| Algorithm | Next-token prediction + semantic search + tool execution |
| Action | Text + CRM updates + emails + escalations |
Generates grounded, policy-consistent responses. Remembers across sessions. Takes real-world actions.
The HubSpot case covered one narrow moment in customer acquisition: what happens after someone becomes a lead.
The chatbot qualified interest. The LLM drafted responses. The system applied a discount code.
But before any of that happened, something had to bring that prospect to HubSpot’s website in the first place.
That is customer acquisition.
And AI enters it at every stage — not just at the lead qualification step.
Where we were in the funnel
Awareness
│
Intent
│
Conversion ← HubSpot chatbot lives here
│
Retention
The question is not what HubSpot should do with its chatbot. The question is how AI enters the full acquisition system.
Where are CMOs actually investing in AI?
Four stages. Different objectives at each.
Awareness The prospect does not know you exist yet. Goal: get seen by the right people.
Intent The prospect is actively looking for a solution. Goal: be findable when they search.
Conversion The prospect is evaluating options. Goal: remove friction and close.
Retention (covered in a later session) The customer has purchased. Goal: reduce churn, expand revenue.
Today: awareness, intent, and conversion.
What marketers actually do at each stage
| Stage | Activities |
|---|---|
| Awareness | Display · paid social · video · sponsorships · PR |
| Intent | Paid search · SEO · comparison content · email capture |
| Conversion | Landing pages · retargeting · lead scoring · chat · nurture |
| Retention | (Week 5 — PittaRosso case) |
The funnel is not abstract. It is a set of tasks, each with its own data, budget, and performance metric.
AI does not enter “marketing.” It enters specific tasks at specific stages — with different data requirements, different risks, and different economics at each.
The Container Store — cookieless lookalike
Type: Supervised learning
Problem: iOS14 and cookie deprecation eliminated third-party signals. Standard lookalike audiences stopped working.
Approach: LiveRamp’s identity graph matched first-party customer data to behavioral signals across the open web, without third-party cookies.
Result:
Source: Total Retail / LiveRamp, 2024
What the model does: Learns the profile of past purchasers. Finds new users who match that profile. Shows them the ad.
Problem 1 — The seed encodes the past
The model’s training data is existing customers. That is who it learns to find more of.
A 25-year-old who just moved into their first apartment has never visited The Container Store. They need home organization products. The lookalike model cannot find them — they look nothing like anyone in the seed.
The model reproduces the existing customer base. It does not find the customers you haven’t reached yet.
If your customer base skews toward a particular demographic or geography, the model concentrates spend there, not by design, but because the seed never included anyone else.
Problem 2 — Incrementality
The model finds likely buyers — not incremental buyers.
Without a holdout test (a control group who did not see the ad) there is no way to know how many of those 37% would have purchased anyway.
Supervised learning and the Day 1 conditions:
| Condition | Container Store |
|---|---|
| Volume high? | ✓ Millions of impressions |
| Signal measurable? | ✓ Purchases tracked |
| Task well-defined? | ✓ Find likely converters |
| Training data representative? | ✗ Seed = past customers only |
The incrementality gap is what the Artea case forces you to measure next week.
Haleon / Panadol — DCO for pain relief (2023)
Type: Reinforcement learning (multi-armed bandit)
Context: Panadol (owned by Haleon) wanted to reposition from “headache tablet” to solution for all pain types. Hong Kong market.
Stack: Zenith (strategy) · Innovid (DCO) · The Trade Desk (targeting)
Approach: 5 base creative templates. Innovid’s engine generated 600+ ad versions matched to pain type, daily moment, and audience.
Example: Office worker + back pain → image of man at desk + Panadol Joint Extend.
Result:
The bandit learning loop
Step 1 — Segment
Audience split by pain type and context:
back pain · headache · fever · joint pain
Step 2 — Select
Choose the creative combination with the
highest predicted completion rate for
this segment. Initially: explore variants.
Over time: exploit what works.
audience: office worker, back pain
image: man at desk, hands on back
product: Panadol Joint Extend
message: "Back to work, not to pain"
Step 3 — Serve
Ad rendered in real time.
600+ variants from 5 templates.
Step 4 — Update
Completion rate observed.
Winning combinations earn more impressions.
Losing combinations fade out.
This is a multi-armed bandit: a simplified form of RL where the agent learns which arm (creative variant) to pull more often, based on observed reward. No labeled training data required.
What makes DCO different from generative AI
DCO selects and assembles from human-designed components. Generative AI creates new content.
| DCO | Generative AI | |
|---|---|---|
| Source | Human templates | Model output |
| Brand control | High | Requires guardrails |
| Scale | High | Very high |
| Accuracy risk | Low | Higher |
The constraint that makes DCO work:
Human creative teams built the 5 base templates. Brand safety, product accuracy, and regulatory compliance live in the template design. The model cannot change them.
Where DCO goes wrong:
The model optimizes completion rate — not brand perception, purchase intent, or long-term equity.
A version that drives clicks may not drive the right association. The metric is a proxy. Treat it as one.
JPMorgan Chase — AI-generated copy (Persado)
Type: Generative AI with explicit constraint layer
Problem: Testing ad copy variants at scale requires writing hundreds of versions manually.
Approach: Persado’s platform generates and tests language variants, predicting emotional response and engagement per phrase × audience combination. Compliance filters and brand voice rules applied before any output is shown.
Result: Up to 450% higher CTR vs. human-written copy in controlled tests.
Source: Persado / JPMorgan Chase
Why it worked:
McDonald’s Netherlands — AI holiday ad (2024)
Depicted festive chaos: cyclists in snow, Santa in traffic, family disaster.
Pulled after launch due to public backlash. Viewers called it “AI slop.”
Coca-Cola — AI holiday ads (2024 and 2025)
AI recreation of the classic 1995 “Holidays Are Coming” ad. Criticized as “soulless” and “creepy.”
Coca-Cola ran a second AI campaign in 2025 despite the 2024 backlash.
Source: Nielsen Norman Group, Dec 2025
NNG’s diagnosis:
“Audiences can perceive when the narrative is shaped around what the technology can do rather than what the story should be.”
The Day 1 conditions applied
| Condition | JPMorgan | McDonald’s |
|---|---|---|
| Task well-defined? | ✓ Headline CTR | ✗ Brand narrative |
| Signal measurable? | ✓ Click rate | ✗ Emotional resonance |
| Feedback fast? | ✓ A/B results | ✗ Brand equity is slow |
| Constraints defined? | ✓ Compliance layer | ✗ Model decides story |
The technology did not fail. The application did.
Criteo — La Redoute dynamic retargeting
Type: Supervised learning + reinforcement learning
La Redoute is a French fashion and home retailer with millions of SKUs. Users browse, do not buy, leave.
Approach: Criteo’s engine tracks product views, predicts purchase probability per user-product combination, and serves dynamic ads across 19,000+ publisher sites, showing the right product at the right moment.
The model also surfaces related products the user had not viewed, predicting adjacent demand.
Result:
Source: Criteo / La Redoute
The model found demand the user did not yet know they had. That is the retargeting opportunity lookalike targeting cannot reach.
Retargeting vs. lookalike: different signal, different risk
| Lookalike (D1) | Retargeting (D4) | |
|---|---|---|
| Stage | Awareness | Conversion |
| Signal | Past purchasers | Current browse behavior |
| Intent window | Weeks | Hours |
| Model knows | Profile similarity | Product interest |
The RL component: timing and offer
Not every retargeted user needs a discount. The model learns: who converts with an incentive, what size, and what delay?
The coupon trap
If the model learns that discounts convert abandoned carts, it serves discounts consistently. Users learn to browse, abandon, and wait.
The model optimized conversion rate and trained customers to expect a discount.
The model optimizes what it measures. The goal was profitable conversion. The metric was conversion rate. Those are not the same thing.
HubSpot: 2017 vs. 2024
Type: All three: supervised, retrieval, generative + RL escalation policy
| 2017 Motion AI | 2024 Breeze AI | |
|---|---|---|
| Architecture | Rule-based | LLM + retrieval + tools |
| Language handling | Scripted menus | Arbitrary inputs |
| Failure mode | Loud: “I don’t understand” | Quiet: confident wrongness |
| Content generation | Human-authored | Near-zero marginal cost |
| Failure rate | ~70% (Facebook data) | Much lower in narrow domains |
The same decision from Part 1 — should HubSpot replace its chat reps? — now has a different technical answer.
Source: HubSpot Breeze AI
What GenAI fixed
The relational intelligence gap narrowed. The 70% failure rate dropped for narrow tasks. Content generation became nearly free.
What GenAI did not fix
The task-fit condition has not changed.
High-volume, narrow task: GenAI helps. Complex relational judgment, emotionally sensitive, ambiguous goal: human advantage persists.
Three new failure modes the 2024 system introduced that the 2017 system did not have
Stale retrieval
Quotes a promotion that expired; no one updated the database.
Constraint gaps
Guardrails only cover cases someone anticipated. Policy gaps are invisible until a customer finds one.
Confident wrongness
The LLM generates fluently whether retrieval succeeded or not. A rule-based system fails loudly. An LLM-based system fails quietly.
Your project tests one design response to these risks. You will answer it with data.
The case for yes
The case for no
The harder question: was HubSpot buying a product — or buying a bet?
The bet: that conversational interfaces would become the primary channel for B2B sales interactions, and that owning that layer early would be worth riding through one technology generation.
There is no clean answer. The case asks you to make the argument.
Due before Week 3 (April 1) — submit via Canvas
Artea ran an A/B test on 5,000 customers. Half received a 20% off coupon. You will analyze the results and recommend a targeting policy for the next campaign of 6,000 customers.
The data: two Excel tabs
AB_test — 5,000 customers, acquisition channel, cart status, past behavior, and outcomes (transactions, revenue) one month later.
Next_Campaign — 6,000 new customers, same variables, no outcomes. These are the customers Artea needs to decide whether to target.
Five questions
Q1 — The Experiment (10 pts) Why does this require a randomized control group? What is one important limitation?
Q2 — Overall Effect (10 pts) What did the coupon do? Are you confident enough to act on it?
Q3 — Heterogeneity (25 pts) Does the effect differ by acquisition channel and cart status? Would you build a targeting policy around it?
Q4 — Targeting Policy (50 pts) State the rule · Predict the effect · Break-even ($0.50 cost, 20% off, $65 avg transaction) · Justify vs. send-all or send-none
Q5 — What Your Policy Cannot Tell You (15 pts) What assumption might not hold, and what happens to your predictions if it is wrong?
Part 1 — HubSpot case
Should HubSpot replace its chat reps with bots? You worked through the CLV math, the breakeven, and the three design questions: disclose or conceal, voice, and interface. The economics can work. The design choices determine whether they do.
Part 2 — How LLMs work
Six steps from raw message to generated response: tokenize, embed, attend, distribute, sample, output. The raw LLM has no guardrails, no memory, no tools. Adding retrieval, tool calls, and a guardrail layer is what turns an LLM into a system.
Part 3 — AI across the acquisition funnel
Five decisions. Three ML types. Five examples. The model optimizes what it measures. Defining the right metric and the right constraints is the marketer’s job, not the model’s.
HubSpot Breeze closes the loop: the same system from Part 2, deployed at the conversion stage — with three failure modes the rule-based bot never had.
Part 3 — AI across the acquisition funnel
Five decisions. Three ML types. Five examples.
| Decision | ML type |
|---|---|
| Who sees the ad? | Supervised |
| Which creative to show? | RL (bandit) |
| What creative to make? | Generative |
| Who to retarget? | Supervised + RL |
| How to qualify the lead? | All three |
The model optimizes what it measures. Defining the right metric and the right constraints is the marketer’s job, not the model’s.
Before next class
Read the Artea case (HBS #521-021).
Group Assignment 1 due before class. Submit on Canvas before you walk in.
There is a quiz at the start of class. Paper and pencil, closed notes.
⚠️ Do not be late.