Prepare for Efficient, Automated, and Advanced Insights with Pandas-AI and witness generative AI capabilities.
Have you ever imagined that you would be able to interact with your data just like best friends? No one might have thought of it.
What if I say, you can do it now?
Well, this is what Pandas AI is for. It is an incredible Python library that empowers your data frames with the capabilities of Generative AI. the time has gone when you spent hours staring at complex rows and columns without making any meaningful progress.
Worry not, Pandas AI is not here to replace Panda, it can be considered as an extension of Panda. It comes with limitless features, imagine having a data frame that can write its own reports or one that can effortlessly analyze complex data and present you with easily understandable summaries. The possibilities are awe-inspiring!
In this concise guide, we’ll take you through a step-by-step journey of harnessing the power of this cutting-edge library, regardless of your experience level. Whether you’re an experienced data analyst or just starting out, this guide equips you with all the necessary tools to confidently dive into the world of it.
So sit back, relax, and let’s embark on an exploration of the thrilling possibilities that it has to offer! Before we deep dive into Pandas AI, let’s brush Panda basics and key features.
Pandas is a powerful open-source Python library that provides high-performance data manipulation and analysis tools. It introduces two fundamental data structures- DataFrame and Series, which enable efficient handling of structured data.
Let’s explore some of the key features of pandas.
It is an extension of Panda with the capabilities of generative AI, taking data analysis to another level. Now, let’s get started with it.
It refers to a Python library called “Pandas AI.” It is a powerful tool that incorporates generative artificial intelligence capabilities into the popular data manipulation and analysis library called Pandas.
Introducing it, an incredible Open Source Project! It expands the power of Pandas, a Python library, by adding generative artificial intelligence features. Acting as a user-friendly interface on top of Pandas, it allows you to interact with your data effortlessly. By using smart prompts with LLMs APIs, you can transform your data into a conversational format. This means you can directly engage with your data, making data exploration more intuitive and interactive.
The best part? With it, you don’t have to create custom in-house LLMS, saving both money and resources.
As we have already mentioned that it is an extension of the Panda capabilities. But how? Let’s explore the role of it in improving the world of data analysis for good.
It brings the power of artificial intelligence and machine learning to the existing Python Pandas library, making it a next-gen tool for simplifying data analysis. It has cut down the time analysts spent on repetitive complex tasks by automating them within minutes. Pandas enhances the productivity of analysts as they can now only focus on high-end decision-making.
It has reduced the time and efforts of analysts in managing the below operations fall within the data analysis pipeline.
Imagine, the implementation of AI to the above operations. Start thinking about where can you implement AI and automate your daily tasks.
When it comes to analyzing data, Exploratory Data Analysis (EDA) is a critical step. It helps analysts uncover insights, spot patterns, and catch any unusual data points. Now, imagine taking EDA to the next level with the help of Pandas AI. This incredible tool automates tasks like data profiling and visualization. It digs deep into the data, creating summary statistics and interactive visuals. This means analysts can quickly understand the nature and spread of different variables. With this automation, the data exploration process becomes faster, making it easier to discover hidden patterns and relationships efficiently.
Dealing with missing data is a frequent hurdle in data analysis, and filling in those gaps accurately can greatly affect the reliability of our findings. Here’s where Pandas AI steps in, harnessing the power of AI algorithms to cleverly impute missing values. By detecting patterns and relationships within the dataset, it fills in the gaps intelligently.
But that’s not all! It takes a step further by automating feature engineering. It identifies and creates new variables that capture complex connections, interactions, and non-linear patterns in the data. This automated feature engineering boosts the accuracy of predictive models and saves valuable time for analysts.
Pandas AI effortlessly blends with machine learning libraries, empowering analysts to construct predictive models and unlock profound data insights. It simplifies the machine learning process by automating model selection, hyperparameter tuning, and evaluation. Analysts can now swiftly test various algorithms, assess their effectiveness, and pinpoint the best model for a specific challenge. The beauty of Pandas AI lies in its accessibility, allowing even non-coders to harness the power of machine learning for data analysis.
With Pandas AI, decision-makers gain the power to explore potential outcomes through simulations. By adjusting data and introducing different factors, this library enables users to investigate “what-if” situations and assess the effects of different strategies. By simulating real-world scenarios, Pandas AI helps make informed decisions and identify the best possible courses of action. It’s like having a crystal ball that guides you toward optimal choices.
Here’s how you can get started with Pandas, including some examples and their corresponding output.
Before you start using PandasAI, you need to install it. Open your terminal or command prompt and run the following command.
pip install pandasai
Once you have completed the installation, you’ll need to connect to a powerful language model on the backend, the OpenAI model. To do this, you’ll need to follow these steps.
These steps will allow you to obtain the necessary API key from OpenAI and set up your project notebook to connect with the OpenAI language model.
Now, you can move to import the following.
import pandas as pd from pandasai import PandasAI from pandasai.llm.openai import OpenAI llm = OpenAI(api_token=your_API_key)
Run the OpenAI model to Pandas AI, using the below command.
pandas_ai = PandasAI(openAImodel)
Run the model on the data frame using two parameters and ask relevant questions.
For example-
pandas_ai.run(df, prompt='the question you would like to ask?')
Now that we have everything in place, let’s start asking questions.
To ask questions using Pandas AI, you can use the “run” method of the PandasAI object. This method requires two inputs: the DataFrame containing your data and a natural language prompt that represents the question or commands you want to execute on your data.
To verify the accuracy of the results, we will compare the outputs from both Pandas and Pandas AI. By observing the code snippets, you can see the outcomes produced by each approach.
You can ask PandaAI to return DataFrame rows with a column’s value greater than a specific value.
For example-
import pandas as pd from pandasai import PandasAI # Sample DataFrame df = pd.DataFrame({ "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"], "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064], "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12] }) # Instantiate a LLM from pandasai.llm.openai import OpenAI llm = OpenAI(api_token="YOUR_API_TOKEN") pandas_ai = PandasAI(llm) pandas_ai(df, prompt='Which are the 5 happiest countries?')
Output- 6 Canada 7 Australia 1 United Kingdom 3 Germany 0 United States Name: country, dtype: object
In the above example, if you want to query to find the sum of the GDPs of the two most unhappy countries, you can run the following code.
For example-
pandas_ai(df, prompt='What is the sum of the GDPs of the 2 unhappiest countries?')
Output- 19012600725504
Visualizing data is essential for understanding patterns and relationships. Pandas perform data visualization tasks, such as creating plots, charts, and graphs. By visualizing data, you can gain insights and make informed decisions about AI modeling and analysis.
For example-
pandas_ai( df, "Plot the histogram of countries showing for each the gdp, using different colors for each bar", )
For example-
prompt = "plot the histogram for this dataset" response = pandas_ai.run(df, prompt=prompt) print(f"** PANDAS AI: {response}")
PandaAI allows you to pass multiple dataframes and ask questions based on them.
For example-
##Example of using PandasAI on multiple Pandas DataFrame import pandas as pd from pandasai import PandasAI from pandasai.llm.openai import OpenAI employees_data = { "EmployeeID": [1, 2, 3, 4, 5], "Name": ["John", "Emma", "Liam", "Olivia", "William"], "Department": ["HR", "Sales", "IT", "Marketing", "Finance"], } salaries_data = { "EmployeeID": [1, 2, 3, 4, 5], "Salary": [5000, 6000, 4500, 7000, 5500], } employees_df = pd.DataFrame(employees_data) salaries_df = pd.DataFrame(salaries_data) llm = OpenAI() pandas_ai = PandasAI(llm, verbose=True, conversational=True) response = pandas_ai([employees_df, salaries_df], "Who gets paid the most?") print(response)
# Output: Olivia
Code source- GitHub
To create the Python code for execution, we first take a small portion of the dataframe, mix up the data (using random numbers for sensitive information and shuffling for non-sensitive information), and send only that portion.
If you want to protect your privacy even more, you can use PandasAI with a setting called enforce_privacy = True. This setting ensures that only the names of the columns are sent to the LLM, without sending any actual data from the data frame.
For example-
Example of using PandasAI with a Pandas DataFrame
import pandas as pd from pandasai import PandasAI from pandasai.llm.openai import OpenAI from .data.sample_dataframe import dataframe df = pd.DataFrame(dataframe) llm = OpenAI() pandas_ai = PandasAI(llm, verbose=True, enforce_privacy=True) response = pandas_ai( df, "Calculate the sum of the gdp of north american countries", ) print(response)
# Output: 20901884461056
Code source- GitHub
PaLM 2 is a new and improved language model made by Google. It’s really good at doing advanced thinking tasks like understanding code and math, answering questions, translating languages, and creating natural-sounding sentences. It’s even better at these things than our previous language models. We made it this way by using better technology and improving how it learns from data.
To use this model, you can get the Google Cloud API Key. After getting the key. Create an instance for the Google PaLM object.
Use the below example to call GooglePalm Model
from pandasai import PandasAI from pandasai.llm.google_palm import GooglePalm llm = GooglePalm(google_cloud_api_key="my-google-cloud-api-key") pandas_ai = PandasAI(llm=llm)
If you want to use the Google PaLM models through Vertexai api, then you must have the following.
After setting everything, then you can create the instance for Google PaLM using VertexAI. Use the below example to call Google VertexAI.
from pandasai import PandasAI from pandasai.llm.google_palm import GoogleVertexai llm = GoogleVertexai(project_id="generative-ai-training", location="us-central1", model="text-bison@001") pandas_ai = PandasAI(llm=llm)
Same as OpenAI, you also need a HuggingFace models
To use this model. You can get the key.
Use the key for instantiating the HuggingFace models. PandasAI supports the following HuggingFace models-
For example-
from pandasai import PandasAI from pandasai.llm.starcoder import Starcoder from pandasai.llm.open_assistant import OpenAssistant from pandasai.llm.falcon import Falcon llm = Starcoder(huggingface_api_key="my-huggingface-api-key") # or llm = OpenAssistant(huggingface_api_key="my-huggingface-api-key") # or llm = Falcon(huggingface_api_key="my-huggingface-api-key") pandas_ai = PandasAI(llm=llm)
from pandasai import PandasAI from pandasai.llm.starcoder import Starcoder from pandasai.llm.open_assistant import OpenAssistant from pandasai.llm.falcon import Falcon llm = Starcoder() # no need to pass the API key, it will be read from the environment variable # or llm = OpenAssistant() # no need to pass the API key, it will be read from the environment variable # or llm = Falcon() # no need to pass the API key, it will be read from the environment variable pandas_ai = PandasAI(llm=llm)
As we delve into Pandas AI and its potential to transform data analysis, it’s crucial to address certain challenges and ethical considerations. Automating data analysis highlights important concerns regarding transparency, accountability, and bias. Analysts need to be cautious when interpreting and validating the results produced by Pandas AI, as they retain the responsibility for critical decision-making based on the insights derived.
Let’s remember that while Pandas AI offers incredible possibilities, human judgment, and careful assessment remain indispensable for making informed choices.
Consider potential challenges and exercise caution when relying on Pandas AI for critical decision-making or sensitive data analysis. Consistent evaluation and validation of the generated results help mitigate these challenges and ensure the reliability of the analysis.
PandasAI holds the potential to revolutionize the ever-changing world of data analysis. If you’re a data analyst focused on extracting insights and creating plots based on user needs, this library can automate the process efficiently. However, there are a few challenges to be aware of while using PandasAI.
The results obtained heavily rely on how the AI interprets your instructions, and sometimes it may not give the expected answers. For example, in the Olympics dataset, the AI occasionally got confused between “Olympic games” and “Olympic events,” leading to potentially different responses.
Nevertheless, its advantages in simplifying and streamlining data analysis make it a valuable tool. It’s advanced functionalities and efficient capabilities are indispensable assets in a data scientist’s toolkit.
Pandas AI is an enhanced representation of Pandas library, which applies artificial intelligence (AI) to make data analysis easier and quicker. It performs tasks such as data cleaning automatically and offers smarter insights with better visualizations.
This software goes beyond the usual pandas by incorporating artificial intelligence into its features. For instance, it automates data cleaning, has advanced visualizations, offers predictive analytics, and allows for querying of data in natural language.
Yes, you can use this along with other tools like the Traditional Pandas library itself, NumPy, Matplotlib, or Seaborn without any issues.
Pandas AI is beneficial for:
Advantages of Pandas AI over traditional Pandas include:
About the Author
Latest Blog