Guide

How To Build A Lightweight ETL Pipeline in Under An Hour

Zachary Kirby
Co-Founder
Published On
August 28, 2023

Extracting value from your Salesforce CRM data is imperative.

Salesforce does have good reporting capabilities, but by definition your data is siloed. You can look at your sales data, but not how it relates to your marketing analytics or product data. By extracting Salesforce data you can break down these data silos, enabling a unified view of your customer interactions.

That’s what we’re going to do today. Below is a step-by-step guide to setting up an ETL (Extract-Transform-Load) pipeline to extract data from Salesforce, transform it business metrics, such as sales rep value and sales cycle length, and load both raw data and transformed data into a database.

You can do this in under an hour. Probably a lot under. And by the end you’ll have immediate access to new versions of your Salesforce data for analysis.

What is an ELT/ETL Pipeline?

ETL stands for Extract, Transform, Load, and is the concept of:

Let’s go through the process then. Here, we are going to set up a simple ETL pipeline that you can get going in literally minutes. All you need is:

The E in ETL: Getting your data out of Salesforce

This is obviously the first step. We want our data out of Salesforce so that we can then a) perform any transformations on it, and b) store it elsewhere in any easier-to-query format.

For this, we have two options:

We’re going to use Vessel. There’s another benefit to using Vessel here–we can also extract data from other services without having to add independent extraction APIs. This is a critical feature of ETL pipelines. The idea isn’t to just extract data from one place. It is to extract data from your entire suite of tools, transform them together, and save that combined data for better insight across your organization.

Check out our docs for how to authorize your Salesforce session. We won’t cover that here and instead dive straight into calling the unified API to grab your salesforce data. Here, we’ll use the Vessel SDK to list out all the deals that have happened today:

import Vessel from "@vesselapi/client-sdk";

[...]

app.post('/list-deals', async (req, res) => {
  const vessel = Vessel({
    apiToken: API_KEY,
    accessToken,
  });
  const response = await vessel.unifications.crm.deals.list({
    filters: {
      modifiedTime: {
        equals: '08/24/23',
      },
    },
  });
  const { result, error } = response.body;
  if (error) throw new Error(error.message);
  res.send({ deals: result });
});

You can call this API endpoint every day to update your list of deals. That’s all you need to extract your data straight out of Salesforce.

The T in ETL: Transforming your data for better insight

Once you have extracted your data, you can start to manipulate it for insights into your business. This is called transformation, but really this part of the workflow can be broken down even further:

dbt is becoming the common tool for transforming data. But it has a steep learning curve, so if you are looking to build an ETL pipeline quickly, its not the best option. It is the best option once you have that pipeline in place and need to solidify your transformations.

But you can do quick transformations on your data as it comes from Salesforce using custom code. Here, we’ll use in-built JavaScript methods to show how you can go from extracting the data from Salesforce in JS to transforming it in JS, to loading it in JS. But Python is a more robust language for data transformations and analytics.

Let’s say our Salesforce data has value and rep information for each deal, as well as the start date and the closed date for the deal. We can transform the raw data into aggregated metrics to help understand our sales cycle.

First, let’s calculate the total close value for each sales rep:

// 'data' is returned from the list-deals endpoint above

const totalCloseValueByRep = data.reduce((acc, record) => {
  if (record.close_value) {
    acc[record.sales_rep] = (acc[record.sales_rep] || 0) + record.close_value;
  }
  return acc;
}, {});

console.log(totalCloseValueByRep);

Which could output something like:

We can also do the same thing by product:

const totalCloseValueByProduct = data.reduce((acc, record) => {
  if (record.close_value) {
    acc[record.product] = (acc[record.product] || 0) + record.close_value;
  }
  return acc;
}, {});

console.log(totalCloseValueByProduct);

We would output:

Finally, we can work out the sales cycle for each deal and then calculate the average length of a closed won deal:

// first, work out each sales cycle

data.forEach(record => {
  if (record.created_on && record.close_date) {
    const createdDate = new Date(record.created_on);
    const closeDate = new Date(record.close_date);
    const durationDays = (closeDate - createdDate) / (1000 * 60 * 60 * 24);
    record.duration = durationDays;
  } else {
    record.duration = null;
  }
});

console.log(data);

//then get the average close duration

const wonOpportunities = data.filter(record => record.deal_stage === 'Won' && record.duration);
const totalDuration = wonOpportunities.reduce((sum, record) => sum + record.duration, 0);
const averageCloseDuration = totalDuration / wonOpportunities.length;

console.log(`Average Close Duration for Won Opportunities: ${averageCloseDuration.toFixed(2)} days`);

Average Close Duration for Win Opportunities: 46.94 days.

From the data, it's evident that  Anna is closing the most value, and "GTXPro" has the highest total close value indicating it is the most lucrative product. Additionally, on average, it takes about 47 days to close a successful deal. This information can be useful for sales teams to set expectations and prioritize which products to focus on.

We want to keep this data, so the next step is to load it into a database.

The L in ETL: Getting your data into storage

The final step is to keep all of your data in a single place. This can be a data warehouse like Snowflake, but it can also be a regular database like PostgreSQL or even S3 blob storage. When you are just pulling in data from a single source like Salesforce, this might seem obsolete. But when you are unifying data from sales, marketing, product, and engineering, having a single repository for all your data and all your transformations makes it easier to get the holistic view of your business.

Let’s get the data above into a Postgres database.

First, you'd need to install the node Postgress pg package:

npm install pg

Then, you can use the following code to save your data:

const { Client } = require('pg');

// Define your database connection settings
const client = new Client({
    user: 'your_username',
    host: 'localhost',
    database: 'your_database_name',
    password: 'your_password',
    port: 5432,
});

// Connect to the database
client.connect();

// Assuming data is your original dataset and transformedData contains the transformed values
const saveData = async () => {
    try {
        // Save the original data
        data.forEach(async (record) => {
            await client.query(
                'INSERT INTO original_data(account, opportunity_id, sales_agent, deal_stage, product, close_date, close_value, created_on, week)'
                + 'VALUES($1, $2, $3, $4, $5, $6, $7, $8, $9)'
                , [record.account, record.opportunity_id, record.sales_agent, record.deal_stage, record.product, record.close_date, record.close_value, record.created_on, record.week]);
        });

        // Save the transformed data (e.g., total close value by rep)
        for (let agent in totalCloseValueByRep) {
            await client.query(
                'INSERT INTO transformed_data_sales_rep(sales_rep, total_close_value)'
                + 'VALUES($1, $2)'
            , [agent, totalCloseValueByRep[rep]]);
        }

        // ... Similarly, you can save other transformations like totalCloseValueByProduct, averageCloseDuration, etc.

    } catch (err) {
        console.error('Error saving data:', err);
    } finally {
        // Close the database connection
        client.end();
    }
};

// Invoke the function
saveData();

Finally, you’ll want to schedule this to run each day (make the modifiedTime a variable that updates each day!) so you can constantly update your transformations and data in your database.

To do this, you need to schedule a cron job, which is simply a script that runs at a scheduled time each day. You can use cloud cron job providers such as Cronhub, AWS Lambda with Scheduled Events, or Google Cloud Scheduler.

But you can also just run this on your local machine. On a Unix-like operating system, open the terminal and type crontab -e to edit the cron jobs for the current user.

Add a new line to schedule your script. For instance, to a file with our fetchDeals code in above every day at 3 am, you'd add:

0 3 * * * /usr/bin/node /path/to/your/script/fetchDeals.js >> /path/to/your/logfile.log 2>&1

And you have a scheduled ETL pipeline.

Building ETL into your pipeline

With just a little bit of JavaScript, you can get an entire ETL pipeline for your Salesforce data up and running in minutes. If you are using the Vessel unified API, then it will be trivial to add other data from your GTM stack and your product to this pipeline for more and more insights.

Insights is what this pipeline is about. It takes your siloed data and brings it together so you can see a holistic view of your operations. By integrating diverse data sources, you unlock a deeper understanding of customer journeys, optimize processes, and drive strategic decision-making.