AI-Driven Insights: Leveraging LangChain and Pinecone with GPT-4

Empowering Next-Gen Product Managers

Elen Gabrielyan
Towards Data Science

--

Working effectively with qualitative data is one of the most important skills a product manager can have; collecting data, analyzing it and communicating it in an efficient way, by coming up with actionable and valuable insights.

You can get qualitative data from many places — user interviews, competitor feedback, or comments from people using your product. Depending on what you’re trying to achieve, you might analyze this data straight away or save it up for later. Sometimes, you might only need a few user interviews to confirm a hypothesis. Other times, you could need feedback from a thousand users to spot trends or test ideas. So, your approach to analyzing this data can change depending on the situation.

With Large Language Models like GPT-4, and AI tools such as LangChain and Pinecone, we can handle various situations and lots of data more effectively. In this guide, I’ll share my experience with these tools. My goal is to show product managers and anyone else who works with qualitative data how to use these AI tools to get more useful insights from their data.

What will you find in this guide-style article?

  1. I’ll start by introducing you to these AI tools, and some current limitations of Large Language Models (LLMs).
  2. I’ll discuss different ways you can make the most of these tools for real-life use cases.
  3. Using user feedback analysis as an example, I’ll provide code snippets and examples to show you how these tools work in practice.

Please note: To use tools like GPT-4, LangChain, and Pinecone, you need to be comfortable with data and have some basic coding skills. It’s also important to understand your customers and be able to turn data insights into real actions. Knowledge of AI and machine learning is a plus, but not a must.

Understanding AI Tools: why you might need LangChain, and Pinecone

Assuming you’re already familiar with GPT-4, it’s important to grasp some concepts as we discuss tools that work with LLMs. One major challenge with current LLMs like GPT-4 is their ‘context window’ — this is how much information they can process and remember at one time.

Currently, there are two versions of GPT-4. The standard one has an 8k token context, while the extended version has a 32k context window. To give you an idea, a 32k token is about 24,000 words, which is roughly equivalent to 48 pages of text. But bear in mind, the 32k version isn’t available to everyone, even if you have access to GPT-4.

Also, OpenAI announced recently about a new ChatGPT model, called gpt-3.5-turbo-16k, which offers 4 times the context length of gpt-3.5-turbo. When working with insight analysis I would suggest working with gpt-4, as it has better reasoning than GPT-3.5. But you can play around and see what works on your use case.

Why am I mentioning this?

When dealing with insight analysis, a big challenge comes up if you have a lot of data, or you’re interested in more than just one prompt. Let’s say you have one user interview. You want to dig deeper and get more insights from it, using GPT-4. In this case, you can just take the interview transcript and give it to ChatGPT, choosing GPT-4. You might have to split the text once, but that’s it. You don’t need any other fancy tools for this. So you would actually need these fancy, new tools when working with a lot of qualitative data. Let’s understand what those tools are, then we will move to some specific use cases and examples.

So what is LangChain?

LangChain is a framework that revolves around LLMs and offers various functionalities like chatbots, Generative Question-Answering (GQA), and summarization. Its versatility lies in the ability to connect different components together, including prompt templates, LLMs, agents, and memory systems.

Prompt templates are pre-made prompts for different situations, while LLMs process and generate responses. Agents help make decisions based on the LLM’s output, and memory systems store information for later use.

In this article I will share some capabilities of it in my examples.

High-level Overview of LangChain Modules

What is Pinecone?

Pinecone.ai is a powerful tool designed to simplify the management of high-dimensional data representations known as vectors.

Vectors are particularly useful when dealing with a lot of text data, like when you’re trying to extract information from it. Consider a situation where you’re analyzing feedback and you want to find out various details about a product. This kind of deep insight gathering wouldn’t be possible with just keyword searches like “great”, “improve”, or “i suggest”, as you might miss out on a lot of context.

Now, I won’t delve into the technical aspects of text vectorization (which could be word-based, sentence-based, etc.). The key thing you need to understand is that words get converted into numbers through machine learning models, and these numbers are then stored in arrays.

Let’s take an example:

The word “seafood” might be translated into a series of numbers like this: [1.2, -0.2, 7.0, 19.9, 3.1, …, 10.2].

When I search for another word, that word also gets transformed into a number series (or vector). If our machine learning model is doing its job properly, words that have a similar context to “seafood” should have a number series that’s close to the series for “seafood”. Here’s an example:

“shrimp” might be translated as: [1.1, -0.3, 7.1, 19.8, 3.0, …, 10.5], where the numbers are close to numbers “seafood” has.

With Pinecone.ai, you can efficiently store and search these vectors, enabling quick and accurate similarity comparisons.

By using its capabilities, you can organize and index vectors derived from LLM models, opening the door to deeper insights and the discovery of meaningful patterns within extensive datasets.

In simpler words, Pinecone.ai allows you to store the vector representations of your qualitative data in a convenient way. You can easily search through these vectors and apply LLM models to extract valuable insights from them. It simplifies the process of managing your data and deriving meaningful information from it.

Representation of Vector Databases

When would you actually need tools like LangChain and Pinecone?

Short answer: when you are working with a lot of qualitative data.

Let me share some use cases from my experience to give you an idea:

  • Imagine you have thousands of written feedback entries from your product channels. You want to identify patterns in the data and track how the feedback has evolved over time.
  • Suppose you have reviews in different languages and you want to translate them in your preferred language, and then extract insights.
  • You aim to conduct competitive analysis by analyzing customer reviews, feedback, and sentiment regarding your competitors’ products.
  • Your company conducts surveys or user studies, generating a significant volume of qualitative responses. Extracting meaningful insights, uncovering trends, and informing product or service improvements are your goals.

These are just a few examples of situations where tools like LangChain and Pinecone can be invaluable for product managers working with qualitative data.

Example Project: Feedback analysis

As a product manager, my job involves improving our meeting notes and transcription features. To do this, we listen to what our users say about them.

For our meeting notes feature, users give us a score between 1 and 5 for quality, tell us which template they used, and also send us their comments. Here is the flow:

In this project, I looked closely at two things: what users said about our feature and which templates they used. I ended up dealing with a huge amount of data — over 20,000 words, which turned into more than 38,000 “tokens” (or pieces of data) when I used a special tool to break it down. That’s so much data that it’s more than what some advanced models can handle all at once!

To help me analyze this extensive data, I turned to two advanced tools: LangChain and Pinecone, supplemented with GPT-4. With these in our arsenal, let’s delve deeper into the project and see what these high-tech tools enabled us to do.

This project’s primary objective was extracting insights from the gathered data, which required:

  1. The ability to create specific queries related to our dataset.
  2. The use of LLMs for handling vast information volumes.

First, I’ll give you an overview of how I carried out the project. After that, I’ll share some examples of the code I used.

We start with a collection of text files. Each file contains user feedback paired with the name of the template they used. You can process this data to fit your needs — I had to do some post-processing for my project. Remember, your files and data might be different, so feel free to tweak the information for your own project.

For example you want to understand users’ feedback on meeting notes structure:

query =  "Please list all feedback regarding sentence structures in a table \
in markdown and get a single insight for each one, and give a general summary for all."

Here’s a high-level diagram showcasing the process flow when utilizing LLM and Pinecone. You ask GPT-4 a question, or what we call a ‘query’. Meanwhile, Pinecone, our library full of all feedback, provides the context to your query, when you send the question itself to it (“embed query”). Together, they help us make sense of our data efficiently:

Below is a more simplified version of the diagram:

Let’s do it! In this script, we set up a pipeline to analyze user feedback data using OpenAI’s GPT-4, Pinecone, and LangChain. Essentially, it imports necessary libraries, sets the path to the feedback data, and establishes the OpenAI API key for processing this data.

import os
import openai
import pinecone
import certifi
import nltk
from tqdm.autonotebook import tqdm
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

directory = 'path to your directory with text files, containing feedback'
OPENAI_API_KEY = "your key"

Then we define and call a function load_docs() that loads user feedback documents from a specified directory using LangChain’s DirectoryLoader. It then counts and displays the total number of loaded documents.

def load_docs(directory):
loader = DirectoryLoader(directory)
documents = loader.load()
return documents

documents = load_docs(directory)
len(documents)

Next define and execute the split_docs() function, which divides the loaded documents into smaller chunks of a specific size and overlap using LangChain’s RecursiveCharacterTextSplitter. It then counts and prints the total number of resulting chunks.

def split_docs(documents, chunk_size=500, chunk_overlap=20):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(documents)
return docs

docs = split_docs(documents)
print(len(docs))

To work with Pinecone, which is basically a vector database we need to get embeddings out of our docs, that’s why we should introduce a function for that. There are many ways to do it, but let’s go with OpenAI’s embedding function:

# Assuming OpenAIEmbeddings class is imported above
embeddings = OpenAIEmbeddings()

# Let's define a function to generate an embedding for a given query
def generate_embedding(query):
query_result = embeddings.embed_query(query)
print(f"Embedding length for the query is: {len(query_result)}")
return query_result

For storing those vectors into Pinecone, you need to create an account there and create an index as well. That’s quite straightforward to do. Then you will get an API key, environment name and the index name from there.

MY_API_KEY_p= "the_key"
MY_ENV_p= "the_environment"

pinecone.init(
api_key=MY_API_KEY_p,
environment=MY_ENV_p
)

index_name = "your_index_name"

index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

The next step is to be able to find answers. It’s like finding the closest points to your question in a field of possible answers, giving us the most relevant results.

def get_similiar_docs(query, k=40, score=False):
if score:
similar_docs = index.similarity_search_with_score(query, k=k)
else:
similar_docs = index.similarity_search(query, k=k)
return similar_docs

In this code, we set up a question-answering system using OpenAI’s GPT-4 model and LangChain. The get_answer() function takes a question as input, finds similar documents, and uses the question-answering chain to generate an answer.

from langchain.chat_models import ChatOpenAI
model_name = "gpt-4"

llm = OpenAI(model_name=model_name, temperature =0)

chain = load_qa_chain(llm, chain_type="stuff")

def get_answer(query):
similar_docs = get_similiar_docs(query)
answer = chain.run(input_documents=similar_docs, question=query)
return answer

We got to the question! Or questions. You can ask as many questions as you wish.

query =  "Please list all feedback regarding sentence structures in a table \
in markdown and get a single insight for each one, and give a general summary for all."

answer = get_answer(query)
print(answer)

Implementing Retrieval Q&A Chain:

To implement the retrieval question-answering system, we use the RetrievalQA class from LangChain. It uses an OpenAI LLM to answer questions and relies on a “stuff” chain type. The retriever is connected to a previously created index and is stored in the ‘qa’ variable. For better understanding, you can learn more about retrieval techniques.

from langchain.chains import RetrievalQA
retriever = index.as_retriever()

qa_stuff = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
verbose=True
)

response = qa_stuff.run(query)

So we got the response, let’s present the content stored in the response variable in a visually appealing format using Markdown text. It makes the displayed text more organized and easier to read.

from IPython.display import display, Markdown

display(Markdown(response))
Example output

Go ahead and experiment both with input files and queries to get the best out of this approach and tools.

Conclusion

In short, GPT-4, LangChain, and Pinecone make it easy to handle big chunks of qualitative data. They help us dig into this data and find valuable insights, guiding better decisions. This article gave a sneak peek into their use, but there’s a lot more they can do.

As these tools continue to advance and become more common, learning to use them now will give you a significant advantage in the future. So, keep exploring and learning about these tools because they are shaping the present and the future of data analysis.

Stay tuned for more ways to explore these handy tools in the future!

All images, unless otherwise noted, are by the author

References

LangChain Documentation

Pinecone Documentation

LangChain for LLM Application Development short course

--

--