Prompt engineering and evaluating AI-generated copy

LLM engineering and evaluation is still a relatively new field. During my time at Thumbtack, I learned new skills in this next generation of content editing.

One of my projects in 2025 involved using LLMs to pre-fill messages that homeowners could send to a home care professional. These messages would be sent whenever a customer sent a job request for work to be done on their house.

LLM-generated customer messages: mini case study

One of our pro product managers came to me with a problem. Customers in our baseline product filled out a job request for some kind of home care - be it landscaping, roof repair, plumbing, you name it. They would pick one or more pros to send their job request to, and ask for a quote and availability.

They could also optionally send a message to the pro, with any relevant information not covered in the details of the job request. This could include any special needs, a few initial suggestions for dates and times, or something else the customer felt their pro should know about the house.

Project goals

Over time, pros had started to rely on these customer messages to help them understand the customer’s project better. Often pros preferred to look at the customer messages first before looking at the details of the project, because the message was more conversational and created an open door through which to reply to the customer. Pros were more likely to reply same-day and quickly if the customer filled these messages out. Otherwise they tended to shelve the projects for looking at later, when they had time to comb the details.

However, these customer messages were optional, and only 32% of customers created one. The rest opted to skip this step and leave the message field blank. As a business, we felt that making the message mandatory would cause too much friction for the average customer.

Among the 32% of messages sent, many of them contained very little helpful information for the pro. Customers often used the space to write things like:

“Could you let me know your rate per hour? I’m shopping around. Thank you.”

…or,

“I was wondering what the process would look like, and if it’d be possible for this to be completed soon. Let me know.”

This was better than nothing, because it still opened that door of communication for the pro to walk through. But it meant more back-and-forth between the pro and the customer, often involving the pro asking for details that could’ve been supplied in the initial message. During that back-and-forth, the customer or pro might get busy or distracted with other priorities, leading to less of these leads converting to finished jobs on Thumbtack.

The PM asked me if there was some way to increase the number of customer messages, improve the quality of these messages, and therefore, get pros replying faster and potentially converting more leads into jobs.

Content goals

I came up with the idea of using an LLM to pre-fill the message box with a sample message. This sample message would include some combination of regular conversation, a summary of the project details filled out by the customer, and a request for a reply from the pro. The customer could then choose to A) send the message as-is, B) edit the message to their heart’s content, and then send, or C) wipe it clean and write their own or leave it blank.

It was important to me that the prefilled message:

Sound plausibly human, like something a customer might actually type;
Have some variety and differentiation, so that pros weren’t getting identical-sounding messages from different customers; and
Contain the most useful information from the customer’s project details, for use by the pro; and
Do it all in the briefest space possible - preferably under 300 characters, to match the average length of messages currently being sent by customers.

A 5th outstanding goal was immediately in flux right from the beginning. I believed the messages should be flagged as having an AI influence. I felt we shouldn’t be trying to fool our pros into thinking these messages were always 100% authentically coming from the customer.

Besides feeling like the honest thing to do, I worried that AI generated messages could never be perfect, and pros might start to ‘sniff out’ the AI influence. If we didn’t label it AI upfront, they might find the whole process disingenuous, and feel that the overall lead quality was low.

The product manager argued that, since the customer could edit the messages at any time, this flag wasn’t necessary, since theoretically, every message was at least approved, if not contributed to, by the customer.

Choosing the right LLM

I started to feed rough draft prompts using sample project details and other information to a series of LLMs that our company had access to. This included LLMs developed by OpenAI, Anthropic, Meta, and Amazon Web Services (AWS).

The goal was to suss out the LLM that would provide the most readable, conversational language based on the project variables provided. In the first round of prompting, I saw it as a tie between two very similar AWS Bedrock LLMs: Llama, an open-source model from Meta, and Mistral, created by a French company with the same name. Both of these LLMs arranged the draft content in a way that felt human and logical, even before additional prompt refining.

Early samples of customer messages from ChatGPT, Anthropic’s Claude, Mistral AI, Qwen Chat, and Meta’s Llama.

Prompt refining

Once I’d picked my top 2 LLMs, I started to enhance and expand the prompts I was feeding them. My goal was to make the initial messages they spit back more useful, more accurate, and more human. This came down to 5 essential sections of the prompt:

1) Persona

The LLM needed to understand the kind of customer they were imitating. It was obviously important to train the LLM to write as a homeowner, looking for a professional.

But I also wanted the LLM to understand the context: this customer was on a double-sided marketplace, pairing pros with customers, and had just filled out a short series of details to initiate a new home project.

2) Grammar and Tone

I wanted these messages to err on the side of friendly and conversational, but not at the expense of a semi-strict character limit. Although in the ‘real world,’ customers could sometimes be curt or rude to pros and vice versa, this was an opening message designed to spark conversation between both parties. So I fed the LLM examples that were on the more pleasant side. In particular, I told it to avoid aggressive or demanding openers, like “You need to help me with my bathroom.”

3) Input Data

I worked with our engineering team to create a data set that mapped to the project intake form filled out by the customer. It included the full search query from the customer (a hand-typed answer from the customer to the question “what are you looking for?”), a list of answers chosen by the customer to other short questions about the project, and some scheduling information.

4) Exceptions

Certain data didn’t work well with the LLM responses. In particular, we offered multiple choice answers to certain project questions that displayed as a range - for example, “I am 50 to 65 years old,” or, “the house is 6-8 rooms.” These answers felt awkward and inhuman when summarized by the AI, as for example “I’m between 50-65 years old and I’m looking for a personal trainer,” or, “I need some painting done in my 6 to 8 room house.” (Everyone essentially knows their exact age, and how many rooms are in their own home.)

5) Task

With all these pieces in place, I refined the AI’s task as:

summarize your request in one to three sentences;
do your best to create a natural-sounding summary;
reword or rephrase some of your original query or answers to questions to make them sound more natural; and
optimize for coherence most of all.