AI Prompting

I’ve spent the last week going through some of the latest language models and working on how best to utilize them and the nuances of how each needs different prompting.

I’ve been using:

  • Google Bard
  • Open AI GPT4 & GPT3.5
  • Anthropic Claude2
  • Llama 2
  • Jasper API

I’ll go into a little about each one and what is exciting about them:

My overall task:

I’m working on creating some overview content for specific pages that takes into account a unique location for a product or service. This means combining location information with product/service content to create something useful in the way of content for the end user.

Google Bard: 

Google had a major release update on July 13th, 2023. There’s a lot more available, and I thought I would give it a try again.

Prompt Strategy:

After a lot of testing, the prompt seemed to work best when I utilized long form paragraphs, numbered tasks, and used refining techniques using multiple prompt iterations.

Response Quality:

I was not able to get a production ready response that I would feel comfortable putting on a website without serious editing. Bard could not follow instructions and I found that it needed me to remind it of about 4 requirements that it kept missing. “You’re right, I apologize for the mistake.” was very common when I asked it to verify the instruction was carried out.

Bard does not do a good job of giving good prompt advice, so it’s a lot of experimentation.

Open AI GPT4:

Prompt Strategy:

My original prompt was for GPT3.5, and I’ve since adapted it for 4. GPT4(Chat), especially since mid June, works very well when I give it a conversational request, almost as if a conversation with a content writer was transcribed. I am able to get away with no shot prompting and it follows all the directions I give it.

Response Quality:

I can get a production ready piece of content in a response. I can run a list of commands that ensure it has followed my request exactly.

Open AI GPT3.5:

Prompt Strategy:

I can use my revised GPT4 prompt, and once it finishes, I have a follow up prompt which forces specific requests that the response typically lacks. This includes specific formatting, and not replacing some text with variables that my CMS will use to pull real time data.

Response Quality:
My response is production quality after running my secondary prompt to clean up and reinforce specific rules and formatting. Quality of content is very close to GPT4 as I provide it with a large amount of background data and content that it utilizes.

Anthropic Claude 2:

I don’t have access to the API, so I’m using the text interface. This is one of my favorite conversational large language models, and they have some good documentation on ways to present data in the prompt to help with context and giving it data to utilize.

Prompt Strategy:

Claude 2 had me completely revise my existing prompt. Even though it can handle 100k tokens, I found that it required much more strict structuring of the prompt compared to OpenAI.

Following their documentation, I utilized XML tagging of data types and surrounded contextual and additional data in specific XML tags that I was able to reference in other parts of the prompt.

I also found that including an example (one-shot) proved to help solidify the format and it didn’t overly utilize the phrasing. Often GPT4 will use such a similar sentence structure that the two pieces of content feel too similar.

Claude 2 has the best prompt diagnosis and suggestion built into the LLM. I was able to fine tune the prompt using the system itself allowing me to get it to a single prompt / response.

Response Quality:

I was able to get a production ready response with a single prompt. The quality of the output was as good or better than GPT4, however I do not have API access so it’s not as easy to test across more content types.

Llama 2:

It’s free. It’s legal to use commercially. Did I mention it’s free? Well, it’s free if you have the hardware to run it. My system, though very capable, doesn’t quite meet that requirement. However, I was able to spin up an AWS endpoint with 4 Nvidia Tesla T4 GPUs able to run the 13b model. 

I’m most excited about this model as it would allow for completely local text generation with complete control of the model, hardware, and privacy.

Prompt Strategy:

I never worked with the Llama v1 as it was a research only version, or you could pull leaked or derivative versions. I went into it expecting it to work similar to GPT3.5, and it does. I am limited to in tokens due to hardware limitations, but I found that my GPT4 prompt worked very well with the Llama 2 LLM.

Response Quality:

I would put the quality closer to GPT 3.5, but with the positive privacy aspects, I feel like there’s a lot of opportunity to do prompt chaining to get the response refined. It had some issues following the instructions as it tended to exaggerate what I wanted done. For example, I wanted it to put p tags around each paragraph, but it also decided to create headings and heading tags. Overall, I liked the way it worded the content though it felt a bit too salesy and positive compared to the same tone types I asked in other models. I tested the “chat” version of the LLM, and had less success with the non-chat versions, however I think that the model with some minor refinements could make this as good as GPT4, and at a much lower overall cost in the long run.

Jasper API:

Having access to this has been really great as I’ve been building out AI tools that need multiple models. Their command endpoint allows for about 6k characters, which lets me really push the size of the prompt.

Prompt Strategy:

This prompt’s focus has been nearly identical to that of GPT4, and I suspect it’s actually using GPT4 or a variation of it as part of the API. Sections of content with context and instructions along with format requests make up this prompt. I found that zero shot worked as well as GPT4.

Response Quality:

I was able to get production quality responses without needing to make any adjustments.

So far it looks like Jasper and GPT4 are fairly easy to get quality results from. I was pleasantly surprised by Anthropic’s Claude 2, and I like the formatting they’ve trained their LLM on. I’m hoping to get access to their API so I can put it really to the test. Llama 2 wasn’t bad, but it couldn’t quite get me production quality content so I’ll have to look into training the model to align closer to what I’m looking to get as a response.

I’m curious how many of you have been creating a prompt library aligned with different LLMs, and if you have found a prompt style that works between all of them.

This was 100% written by hand, no AI wrote any portion of this content. Does that make this a better article? Love to hear your thoughts on that too!