Building a Budget App

Phase 1: Learning the hard way

Author

Robert Crump

Published

May 1, 2023

github Github repository

Introduction

My motivation for this project is twofold: I want to keep better track of my spending and budgeting, and I want to improve my coding skills. There are plenty of Excel templates or budgeting apps like Mint that allow people to make and maintain budgets. Personally, I feel constrained by these products because they don’t give me the level of flexibility and control that I’m looking for. At the same time, I want to expand my coding skills and revisit some coding concepts that I didn’t fully grasp while in grad school.

I want to solidify my understanding of user-generated functions, for-loops, and iteration by building a solution from scratch. As I was developing this project, a useful phrase occurred to me about how to shift my thinking about data science techniques. Using “data structures as tools” is a significant step for my knowledge and skills. Specifically, I will be using the list structure in R to create the functionality I want from my budgeting app. This solution is built around the data I get from my bank and suited to my particular needs, but I’m already thinking about how to build a more generalized application to further expand my skills into machine learning.

Another important focus for me is workflow and repository management. I want to get more accustomed to both RStudio Projects and GitHub. I’m also practicing commenting and organizing my code so that it is more accessible. One last ancillary goal is to get practice explaining data concepts to a general audience. I won’t go into extreme detail on every line of code, but I will try to draw out the major concepts to explain my thought process as I worked towards the end goal.

Compile Data

As with any other data project, the first step is simply to look at the data to see how it’s structured and what information is contained within it. Below is a sample of transaction data from my bank. I’ve altered all the information within each cell for privacy, but kept them similar enough to produce the same interactions.

Posted Date Reference Number Payee Address Amount
4/6/2022 2.469216e+22 SQ *PLEIN AIR CAFE Chicago IL Chicago IL -5.98
4/5/2022 2.413746e+22 CVS/PHARMACY #05836 CHICAGO IL CHICAGO IL -23.03
4/4/2022 2.449398e+22 TRADER JOE’S #706 QPS CHICAGO IL CHICAGO IL -23.61
4/4/2022 2.413746e+22 WHOLEFDS HDP#10458 CHICAGO IL CHICAGO IL -22.55
4/4/2022 2.469216e+22 COMCAST CHICAGO 800-COMCAST IL 800-COMCAST IL -34.26

The columns included in my credit card statements are Posted Date, Reference Number, Payee, Address, and Amount. I funnel as many expenses as possible through my credit card so that I can generate cash-back rewards over time. This also simplifies my expense tracking to a single output point. Expenses like rent and others that come directly from my checking account are not recorded for this exercise.

After I download all the monthly statements from my bank’s website, I collect each file into a specific folder within my repository. I am using GitHub to manage my version control and to publish this website (for more information on GitHub and repositories click here). Using folders and logical directory structures makes referencing files in code much easier and enables the iterative list functionality mentioned before. For example, the code below combines monthly statements into a single file.

Code
list.files(path = "data/CC_stmt_raw/", 
           pattern = "*.csv",
           full.names = TRUE) %>% 
lapply(read_csv) %>% 
bind_rows()

The list.files function produces a character vector of all the files in a location specified by the path argument that also match the text in the pattern argument. Thanks to RStudio Projects, I only need to specify the top-level folder within the project folder instead of a full Windows drive directory. A logical file directory structure allows me to easily reference different folders that contain files needed for an operation. In this case, I’m referencing a subfolder called CC_stmt_raw within the top-level data folder that contains all my .csv files of credit card transactions for different months. The output of the first function looks like this:

Code
list.files(path = "data/CC_stmt_raw/", 
           pattern = "*.csv")
 [1] "April2022.csv"     "August2022.csv"    "December2022.csv" 
 [4] "February2022.csv"  "February2023.csv"  "January2022.csv"  
 [7] "January2023.csv"   "July2022.csv"      "June2022.csv"     
[10] "March2022.csv"     "March2023.csv"     "May2022.csv"      
[13] "November2022.csv"  "October2022.csv"   "September2022.csv"

Each element in the vector ends with .csv, and the * symbol before .csv is a regular expression that matches any other characters preceding .csv.

The next two functions in the sequence use the character vector to read the contents of each file and combine them into a single dataframe. lapply applies a function across a vector, in this case read_csv is applied to all file paths in the character vector created by list.files. The output from this function is a list of 15 dataframes (corresponding to the number of files/months in the subfolder) contained in a list. bind_rows adds each dataframe into one large dataframe by appending rows vertically.

Code
lapply(read_csv) %>% 
bind_rows()

Next, I do some basic data wrangling to organize the dataframe the way I want, and assign it to a named object, CC_stmt_allmonths. All together, the code sequence or “pipe” looks like this:

Code
CC_stmt_allmonths <-
  list.files(path = "data/CC_stmt_raw/", 
             pattern = "*.csv",
             full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows() %>% 
  
  # data wrangling
  select(-Address) %>% 
  rename("date" = `Posted Date`,
         "id" = `Reference Number`,
         "source" = Payee,
         "amount" = Amount) %>%
  
  # convert to proper date format
  mutate(date = mdy(date)) %>%
  
  # remove payments made to CC balance
  filter(!str_detect(source, "PAYMENT"),
         
         # adjust date range
         date >= as.Date("2022-01-01")) %>%
  arrange(date)

First Attempt

Now that I have all my monthly statements compiled into a single dataframe, I need to label each transaction. This is where things started to go sideways for me. Ultimately, this is a lesson in doing things the hard way to inspire learning how to code efficiently. The problem I’m facing with this dataset is that it is information poor despite being somewhat long.

nrow counts the number of rows in a dataframe:

Code
nrow(CC_stmt_allmonths)
[1] 1094

names shows the column names in a dataframe:

Code
names(CC_stmt_allmonths)
[1] "date"   "id"     "source" "amount"

My goal is not necessarily to reduce work at this stage in development. I am aiming for control and accuracy in how transactions are categorized. The most direct and precise way to do this is by manually assigning values to each row in Excel. In my case, it would take 1094 separate entries across two columns to start, then labeling transactions that come in every month. That method may help me to keep better track of my spending, but would eliminate the coding challenge and… it’s a complete pain in the ass.

My first idea was to look for recurring patterns in source, which represents the vendor that received the payment. Three patterns jumped out at me immediately. Transactions made using the Square payment system have “SQ” in the vendor name, and payments handled by Toast show up as “TST”. Amazon transactions appear in different formats but all contain “amzn.com”. To apply the same label to all of these transactions, I used the stringr and dplyr packages to create a new category variable based on source.

Code
CC_stmt_allmonths %>% 
  mutate(category = case_when(
    str_detect(source, fixed("sq", ignore_case = T)) ~ "Food",
    str_detect(source, fixed("tst", ignore_case = T)) ~ "Food",
    str_detect(source, fixed("amzn.com", ignore_case = T)) ~ "Amazon",
    TRUE ~ "other")
  ) %>% 
  filter(category != "other") %>% 
  nrow()
[1] 410

dplyr supplies the mutate function that generates a new variable and the case_when function that specifies a condition. In this example, I’m using str_detect from the stringr package to match a string (either “sq”, “tst”, or “amzn.com”) to determine the category label. I was initially very encouraged by this development because it categorized 410 out of my original 1094 transactions with just a few lines of code. If other patterns could help to identify large groups of transactions, I could continue reducing the number of unlabeled transactions until none remained.

Unfortunately, the inconsistency and proliferation of vendor names meant that I had to produce over two hundred string detection rules to label each transaction with a category and subcategory. As I was typing and pasting all these lines, I kept making clerical errors while getting frustrated and fatigued. I knew there had to be a better way, but convinced myself to use a temporary solution while I figured that out.

Ultimately, I’m glad that I went through the frustration of this misguided process because it gave me the experience of coding an iteration by hand. Adding a new line to label each transactions’ category and another line to label each transactions’ subcategory was labor-intensive, difficult to organize, and error-prone. Exactly what I was trying to avoid. I decided to abandon this method almost after the very moment I completed it.

Taking a Step Back

To gain accuracy and any amount of efficiency, I had to take a different approach. I knew that eventually I wanted to employ some sort of looping function to label transactions, but wasn’t very comfortable with the technique and didn’t know how to implement it. The experience of writing over 200 case_when rules convinced me that I had to figure it out now. Plus, it would give me the opportunity to use a personal project to make progress on mastering a concept that had been evading me for years.

The script I ended up using employs a similar technique to read and compile the .csv files for each month’s expenses. I was able to plug the lapply function into a sequence of other functions because list.files and read_csv perform specific predefined operations. After a few days of experimenting, googling, and even using rtutor.ai, I realized the best solution for me was to preprocess the data manually using Excel then design my code around the initial setup.

I resisted doing anything in Excel because it seemed to defeat the purpose of learning to code by taking on new data science challenges. I also wanted to avoid designing a process that required more than one program. The idea here is to replicate expense trackers that automatically label transactions, not take detours into Excel midway through a process. When I took a step back and thought about what I was trying to accomplish and what a solution would entail, I had another important realization about the flaw in my approach.

By narrowly focusing on one tool, I limited my options. With Excel, I can place each transaction in a sheet named corresponding to its subcategory. Using the readxl package in R, I can extract the sheet names as a character vector, then use a for-loop to iteratively label each transaction according to its position among the preprocessed Excel sheets. I also realized that I didn’t have to think linearly by writing my code to assign the primary category first. Since each primary category is a collection of subcategories, I can easily correlate them using a simple character vector with case_when.

Before starting deciding on Excel, I thought it would be another long tedious process. But Excel has sorting functions and hotkeys that make navigating between sheets while moving data cells around go pretty quickly. Additionally, pre-processing enables other techniques that will pay off later. Using R to accomplish the same thing was much less effective and wildly less efficient.

Coding the Solution

Before I can code the solution I have in mind, I have to preprocess my data. Compiling the data combines each month’s transactions into a single dataframe 1094 rows long. Instead of manipulating every transaction by row, I only need the unique values in the source column. I can grab those and create a new Excel file using the code below:

Code
# save unique values in new csv
write_csv(as_tibble(unique(CC_stmt_allmonths$source)),"data/expense_sources.csv")

This step reduces the number of transactions I need to preprocess down to 323. Those 323 unique values will act as a key to assign categories to the full dataset. Keep in mind that the variable source refers to the source of the transaction, that is, the vendor name like a gas station or bookstore or internet subscription.

Next, I opened that file and a blank .xlsx file in Excel. In the new .xlsx file, I added 12 sheets and labeled each corresponding to the subcategory I want to use. I then used Excel’s sorting features and hotkeys to copy vendor names from the full dataset into smaller groups broken out across the 12 sheets, and voila, preprocessing is complete! This took all of about an hour once I figured out the hotkeys to simplify the work, and the most of that time was spent was deciding what labels to use and where to place various transactions. The previous method in R took several hours across multiple days, and it still didn’t achieve a satisfactory result.

The subcategories I chose are: (“housing”, “utilities”, “transportation”, “household”, “health”, “groceries”, “eating_out”, “amazon”, “entertainment”, “technology”, “gift”, “misc”). With readxl::excel_sheets(), I can easily extract sheet names from the Excel file, which I saved in my data folder as expense_sources_bysubcat. If I want to change my subcategories, I just need to alter the sheet labels in the newly created Excel file, and I can simply add new entries each month through the same manual preprocessing step. For now, this a workable solution that fits neatly into my workflow. I’ll also write three character vectors to define the primary categories at this stage.

Code
# create file path string
path <- "data/expense_sources_bysubcat.xlsx"

# grab sheet / subcategory names
subcats_vector <- excel_sheets(path)
trimws(subcats_vector)
 [1] "housing"        "utilities"      "transportation" "household"     
 [5] "health"         "groceries"      "eating_out"     "amazon"        
 [9] "entertainment"  "technology"     "gift"           "misc"          
Code
# preliminary step, assigning subcategories to primary categories
upkeep <- c("housing", "transportation", "utilities", "health", "household")
food <- c("groceries", "eating_out")
discretionary <- c("entertainment", "gift")

Next, I convert expense_sources_bysubcat to an R list object, again using lapply, this time with a user defined function wrapped around readxl::read_excel. The code below uses subcats_vector to pull the different sheets from the Excel file defined by path into a list.

Code
# extract content from excel sheets (expense sources / vendor names)
# place in list structure
expense_source_list <- lapply(subcats_vector, function(x){
  read_excel(path, sheet = x)
})

# apply subcategory names to list positions
names(expense_source_list) <- strsplit(subcats_vector, ",")

Now, I have a list structured in such a way that I can iterate over it with a for-loop to build a subcategory key. Each list element contains all the unique vendor names that correspond to a particular subcategory which are specified by the names function in the previous code block.

The code below builds a dataframe by iterating over the list elements (unique vendor names) and the list element names (“housing”, “groceries”, etc.) to assign subcategory labels according to where the transaction is located in the list. i indicates the index of the list elements over which the loop iterates.

Code
# create empty data frame
source_to_subcat <- data.frame()

# iterate over list elements and names to build dataset
for (i in 1:length(expense_source_list)){
  
  # extract information from list structure
  x <- data.frame(
                  # unique vendor names from list element
                  expense_source_list[i],
                  
                  # subcategory from list name
                  names(expense_source_list)[i])
  
  # specify column names and restrict range to two columns
  names(x) <- c("source", "subcategory")
   
  # vertically append rows
  source_to_subcat <- bind_rows(source_to_subcat, x)
}

kable(head(source_to_subcat, 10))
source subcategory
LEMONADE I* LEMONADE I LEMONADE.COM NY housing
MOVINGHELP.COM 800-789-3638 AZ housing
COMCAST CHICAGO 800-COMCAST IL utilities
7-ELEVEN 35808 AUSTIN TX transportation
BYT* CMTA AUSTIN TX transportation
CAP METRO 5124741200 TX transportation
CAP METRO AUSTIN TX transportation
CAPITAL METROPOLITAN T 512-389-7564 TX transportation
CHICAGO O’HARE INTERNA CHICAGO IL transportation
ENTERPRISE RENT-A-CAR AUSTIN TX transportation

Now that each unique vendor name has a subcategory attached, I can match a primary category in another column. I achieve this by revising the case_when solution in a more constrained and reasonable way.

Code
# match subcategory to primary category
source_to_primecat <- source_to_subcat %>% 
  mutate(primecategory = case_when(subcategory %in% upkeep ~ "upkeep",
                                   subcategory %in% food ~ "food",
                                   subcategory %in% discretionary ~ "discretionary",
                                   
                                   # everything else goes into `other`
                                   TRUE ~ "other"))

kable(head(source_to_primecat, 10))
source subcategory primecategory
LEMONADE I* LEMONADE I LEMONADE.COM NY housing upkeep
MOVINGHELP.COM 800-789-3638 AZ housing upkeep
COMCAST CHICAGO 800-COMCAST IL utilities upkeep
7-ELEVEN 35808 AUSTIN TX transportation upkeep
BYT* CMTA AUSTIN TX transportation upkeep
CAP METRO 5124741200 TX transportation upkeep
CAP METRO AUSTIN TX transportation upkeep
CAPITAL METROPOLITAN T 512-389-7564 TX transportation upkeep
CHICAGO O’HARE INTERNA CHICAGO IL transportation upkeep
ENTERPRISE RENT-A-CAR AUSTIN TX transportation upkeep

Finally, I use the source_to_primecat key of unique vendor names and category labels with left_join to apply labels to the full dataset.

Code
# merge category key with full dataset
expenses_labled <- CC_stmt_allmonths %>% 
  left_join(source_to_primecat)

kable(head(expenses_labled, 10))
date id source amount subcategory primecategory
2022-01-03 2.469216e+22 COMCAST CHICAGO 800-COMCAST IL -11.26 utilities upkeep
2022-01-03 2.449216e+22 EVENT* TIM AND ERIC BR VIEWSTUB.COM FL -11.95 entertainment discretionary
2022-01-03 2.469216e+22 SQ *TRUENORTH CAFE HYDE PChicago IL -5.13 eating_out food
2022-01-04 2.449398e+22 TRADER JOE’S #706 QPS CHICAGO IL -30.17 groceries food
2022-01-04 2.444500e+22 METRA MOBILE CHICAGO IL -4.00 transportation upkeep
2022-01-04 2.416407e+22 FEDEX OFFIC36000036095 CHICAGO IL -4.54 household upkeep
2022-01-04 2.469216e+22 FARMER’S FRIDGE CHICAGO IL -5.47 eating_out food
2022-01-04 2.469216e+22 SQ *TRUENORTH CAFE HYDE PChicago IL -5.80 eating_out food
2022-01-04 2.469216e+22 AMZN Mktp US*FC9T38QM3 Amzn.com/billWA -17.63 amazon other
2022-01-05 2.413746e+22 CVS/PHARMACY #05836 CHICAGO IL -17.67 household upkeep

Reflection

Nothing is quite so edifying as building a tool to resolve a personal need. Not only is this work product a practical application, it’s also built around my own real world data and empowers me to better track my spending. Crafting the solution gave me an opportunity to overcome an obstacle in my own coding comprehension and identify a flaw in my problem-solving approach.

Solving the core problem of this exercise required me to expand my thinking about data structures. They aren’t just objects to be managed and manipulated, they can be used as tools to build other data structures. This project gave me a much stronger understanding of a fundamental data concept. I now feel confident that I can design a data structure to perform a task through iteration instead of solely relying on existing functions. Much like an assembly line is comprised of discrete actions that result in a fully built product, a for-loop can use a designed data structure like a recipe to assemble a dataframe.

The next phase I have in mind for this project is to train a machine learning model to eliminate the manual preprocessing step. I have some exposure to machine learning, but I’ve never built an application. The real payoff of the manual preprocessing step is that I now have a fully realized training dataset. I am excited to continue learning while getting some practical experience in a critically important field. My plan is to use this textbook and/or tidymodels to educate myself. Until then, I am very happy with my progress and am looking forward to phase two.