I'm sorry, but the human examples you're comparing against are just plain bad. The first one (strategy) is a typo-ridden bullet list of features. If I submitted that to my CEO, I'd lose my job. The answers to the second one (north star metric) are both terrible. Again, the human one has typos.
Neither of them dive into the why, neither of them are designed to drive engagement, set context, or drive conversation. They're both, objectively, very poor answers to the question.
All we seemed to have validated here is that there a lot of PMs (or potential PMs) who need a LOT of coaching to be effective and don't receive it and that AI can do a passable job at impersonating someone who is at the very beginning of their product career.
I'd also point out that if these are the BEST answers that a full-time prompt engineer can goad out of ChatGPT and that ChatGPT is completely incapable of doing the actual hard work of writing a strategy -- driving buy-in, agreement, etc.
The difficulty in doing an experiment like this is that:
a) everybody has a different idea of what a good answer is (hundreds of Lenny's audience voted in this poll and many preferred the AI)
b) the real answers to questions like this are not published publicly (exponent is one of the rare places they are, and these were among the highest rated answers)
c) it's hard to know how much effort to put into beating the human (I spent a few hours on each prompt to make the article interesting, but there's aways more to optimize)
I feel like this exercise was enough to prove the point and start a discussion. It's at least interesting that hundreds of people voted for AI over highly rated human answers to a task. What that says about the state of product management vs the rise of AI is an exercise for the reader. :-)
FYI we're planning a more comprehensive follow up. If you'd be open to it, I'd love to do a 'stretch' experiment as a follow up where you submit your answer to a task, I try and beat it with AI, and we run another experiment. We could include it in our PM hard evaluation benchmark we're constructing as a next step to see if these results hold. Can't do it properly without more real human answers from top experts though!
Thank you for putting this together! Just bought your book and am excited to read more.
As you've shown, LLMs can produce high-quality responses, but only with very detailed prompts for each question. The challenge here is that the LLM overfits to that specific question and would perform poorly on the wide spectrum of possible questions.
Similar to classical ML, the ideal approach would be to prompt an LLM with many examples of questions and see how that single LLM performs over diverse questions. Have you discovered prompting techniques that can teach the LLM to become a generally smart PM to provide good responses across diverse questions?
Yes really good point. I think I can use the standardised patterns I used in these prompts to automatically generate new prompts for any PM task. Though the process is still quite brittle.
I see we've spent time nit-picking the experiment, which is totally valid. I'm more interested in the conclusions of the experiment. Let's say that AI improves and can outperform product managers in 24 months, what does that look like? According to Carta's most recent reports, PM job hiring has significantly decreased in the last year. Are we already seeing the impact of AI on product? Will our jobs become obsolete?
I suspect the slow down is more related to the end of the ZIRP era rather than AI. It's still too early and still not enough companies have adopted it fully. Rather than the end of the job I just suspect good PMs will be expected to have higher output. The companies I contract with haven't fired anyone but they have been slow to hire, so I think it'll be a sort of stasis at worst.
The fundamental flaw I found with this is experiment is that it requires a corpus of known good data (strategy docs, specs, etc.) to start with. Setting aside whether the content from Exponent is good or not (it’s actually quite bad), all that we’ve demonstrated is that AI is great at regurgitating similar content when properly prompted. You’ll also notice that all the examples were for well-known B2C products for which everyone has an opinion on what should be the next move. But no such corpus of data exists for the startup nobody’s heard of.
True, if we get to a world in which AI is listening into and transcribing every single business conversation we have inside a company (an Orwellian future, but who knows) then AI might start to approach the output quality of a good PM. But will it actually be able to exercise good judgment or propose original ideas, or simply parrot what’s popular from that corpus? If it can truly do those things, then we’ve created sentience, and I’m not sure we should be worried solely about AI replacing PMs at that point.
Yes I find that point very encouraging: we can offload all of our regurgitation tasks to the LLMs and spend all our time developing eclectic tastes to more interesting and innovative work. We can put those unique blends of preferences into the prompts to give the AI better context and data to make more interesting strategic decisions. That would be a huge win for the PM profession! I guess what this shows is just how much of the PM role as it currently stands can be done by regurgitating repeating patterns.
Interesting experiment but I'm left with a lot of questions.
First, if the prompt demands the response to be similar to the provided example, what attributes are we expecting to be different between human and AI outputs? Whats the goal here? To compare comprehensiveness between two outputs using very similar approaches?
Second, strong reservations on the inputs/examples used, Beau has articulated those very well in another comment.
Third, disagree with the author that the fact that 100s of people voted for an AI response is a meaningful takeaway in itself. It's a good way to get engagement on socials but doesn't tell us any more.
I like the experiment's idea but it warrants more rigorous analysis.
How would you have designed the experiment differently? The goal is to follow up with more rigorous analysis given the interesting results of this test.
a) with prompt engineering the goal is usually to match human expectations on a task done my an LLM, so very often you have examples of the existing task done by a human, and you rate the responses based on human preferences as we have done here. Everyone has subjective differences of opinion on what a 'good' answer is on these fuzzy tasks so the best you can do is try to maximise on their subjective preferences.
b) I addressed beau's comment but essentially this relates to subjectivity. Exponent has some of the best publicly available examples of real PM work, and these are among the best rates examples. Everyone keeps telling me these examples aren't good and they have better examples internally, yet if there aren't good examples we can point to publicly it's hard to advance the state of product management.
c) human preference voting is precisely how models are evaluated by researchers. Our experiment wasn't designed to be held up to rigorous academic standards as I'm an operator not a researcher, but I am not sure what we could use other than human preferences to settle the debate on which answers are better.
It's a great article, and I really appreciated the methodology you used and the rigor of the analysis.
I think a lot of current PM tasks i.e. stuff we all do to achieve the outcome needed, will get augmented or replaced with LLMs (and other tools).
The part that'll take longer is the true PM Agent - the initiative someone needs to take and the grit and judgement you need to show. It's not impossible either. Over time I can see a lot of functions collapsing.
I tried to write something up about this a little while ago..
I'm sorry, but the human examples you're comparing against are just plain bad. The first one (strategy) is a typo-ridden bullet list of features. If I submitted that to my CEO, I'd lose my job. The answers to the second one (north star metric) are both terrible. Again, the human one has typos.
Neither of them dive into the why, neither of them are designed to drive engagement, set context, or drive conversation. They're both, objectively, very poor answers to the question.
All we seemed to have validated here is that there a lot of PMs (or potential PMs) who need a LOT of coaching to be effective and don't receive it and that AI can do a passable job at impersonating someone who is at the very beginning of their product career.
I'd also point out that if these are the BEST answers that a full-time prompt engineer can goad out of ChatGPT and that ChatGPT is completely incapable of doing the actual hard work of writing a strategy -- driving buy-in, agreement, etc.
The difficulty in doing an experiment like this is that:
a) everybody has a different idea of what a good answer is (hundreds of Lenny's audience voted in this poll and many preferred the AI)
b) the real answers to questions like this are not published publicly (exponent is one of the rare places they are, and these were among the highest rated answers)
c) it's hard to know how much effort to put into beating the human (I spent a few hours on each prompt to make the article interesting, but there's aways more to optimize)
I feel like this exercise was enough to prove the point and start a discussion. It's at least interesting that hundreds of people voted for AI over highly rated human answers to a task. What that says about the state of product management vs the rise of AI is an exercise for the reader. :-)
FYI we're planning a more comprehensive follow up. If you'd be open to it, I'd love to do a 'stretch' experiment as a follow up where you submit your answer to a task, I try and beat it with AI, and we run another experiment. We could include it in our PM hard evaluation benchmark we're constructing as a next step to see if these results hold. Can't do it properly without more real human answers from top experts though!
This was one of the best reads I have had about such topics. Thanks for sharing more on the prompt engineering and the examples. Very cool
Thank you for putting this together! Just bought your book and am excited to read more.
As you've shown, LLMs can produce high-quality responses, but only with very detailed prompts for each question. The challenge here is that the LLM overfits to that specific question and would perform poorly on the wide spectrum of possible questions.
Similar to classical ML, the ideal approach would be to prompt an LLM with many examples of questions and see how that single LLM performs over diverse questions. Have you discovered prompting techniques that can teach the LLM to become a generally smart PM to provide good responses across diverse questions?
Yes really good point. I think I can use the standardised patterns I used in these prompts to automatically generate new prompts for any PM task. Though the process is still quite brittle.
I see we've spent time nit-picking the experiment, which is totally valid. I'm more interested in the conclusions of the experiment. Let's say that AI improves and can outperform product managers in 24 months, what does that look like? According to Carta's most recent reports, PM job hiring has significantly decreased in the last year. Are we already seeing the impact of AI on product? Will our jobs become obsolete?
I suspect the slow down is more related to the end of the ZIRP era rather than AI. It's still too early and still not enough companies have adopted it fully. Rather than the end of the job I just suspect good PMs will be expected to have higher output. The companies I contract with haven't fired anyone but they have been slow to hire, so I think it'll be a sort of stasis at worst.
The fundamental flaw I found with this is experiment is that it requires a corpus of known good data (strategy docs, specs, etc.) to start with. Setting aside whether the content from Exponent is good or not (it’s actually quite bad), all that we’ve demonstrated is that AI is great at regurgitating similar content when properly prompted. You’ll also notice that all the examples were for well-known B2C products for which everyone has an opinion on what should be the next move. But no such corpus of data exists for the startup nobody’s heard of.
True, if we get to a world in which AI is listening into and transcribing every single business conversation we have inside a company (an Orwellian future, but who knows) then AI might start to approach the output quality of a good PM. But will it actually be able to exercise good judgment or propose original ideas, or simply parrot what’s popular from that corpus? If it can truly do those things, then we’ve created sentience, and I’m not sure we should be worried solely about AI replacing PMs at that point.
Yes I find that point very encouraging: we can offload all of our regurgitation tasks to the LLMs and spend all our time developing eclectic tastes to more interesting and innovative work. We can put those unique blends of preferences into the prompts to give the AI better context and data to make more interesting strategic decisions. That would be a huge win for the PM profession! I guess what this shows is just how much of the PM role as it currently stands can be done by regurgitating repeating patterns.
... so, won't be replacing at all?
Interesting experiment but I'm left with a lot of questions.
First, if the prompt demands the response to be similar to the provided example, what attributes are we expecting to be different between human and AI outputs? Whats the goal here? To compare comprehensiveness between two outputs using very similar approaches?
Second, strong reservations on the inputs/examples used, Beau has articulated those very well in another comment.
Third, disagree with the author that the fact that 100s of people voted for an AI response is a meaningful takeaway in itself. It's a good way to get engagement on socials but doesn't tell us any more.
I like the experiment's idea but it warrants more rigorous analysis.
How would you have designed the experiment differently? The goal is to follow up with more rigorous analysis given the interesting results of this test.
a) with prompt engineering the goal is usually to match human expectations on a task done my an LLM, so very often you have examples of the existing task done by a human, and you rate the responses based on human preferences as we have done here. Everyone has subjective differences of opinion on what a 'good' answer is on these fuzzy tasks so the best you can do is try to maximise on their subjective preferences.
b) I addressed beau's comment but essentially this relates to subjectivity. Exponent has some of the best publicly available examples of real PM work, and these are among the best rates examples. Everyone keeps telling me these examples aren't good and they have better examples internally, yet if there aren't good examples we can point to publicly it's hard to advance the state of product management.
c) human preference voting is precisely how models are evaluated by researchers. Our experiment wasn't designed to be held up to rigorous academic standards as I'm an operator not a researcher, but I am not sure what we could use other than human preferences to settle the debate on which answers are better.
Any thoughts?
It's a great article, and I really appreciated the methodology you used and the rigor of the analysis.
I think a lot of current PM tasks i.e. stuff we all do to achieve the outcome needed, will get augmented or replaced with LLMs (and other tools).
The part that'll take longer is the true PM Agent - the initiative someone needs to take and the grit and judgement you need to show. It's not impossible either. Over time I can see a lot of functions collapsing.
I tried to write something up about this a little while ago..
https://salgar.substack.com/publish/post/140727611
Thanks for the article!
Agreed