Posted Date | Reference Number | Payee | Address | Amount |
---|---|---|---|---|
4/6/2022 | 2.469216e+22 | SQ *PLEIN AIR CAFE Chicago IL | Chicago IL | -5.98 |
4/5/2022 | 2.413746e+22 | CVS/PHARMACY #05836 CHICAGO IL | CHICAGO IL | -23.03 |
4/4/2022 | 2.449398e+22 | TRADER JOE’S #706 QPS CHICAGO IL | CHICAGO IL | -23.61 |
4/4/2022 | 2.413746e+22 | WHOLEFDS HDP#10458 CHICAGO IL | CHICAGO IL | -22.55 |
4/4/2022 | 2.469216e+22 | COMCAST CHICAGO 800-COMCAST IL | 800-COMCAST IL | -34.26 |
Building a Budget App
Phase 1: Learning the hard way
Introduction
My motivation for this project is twofold: I want to keep better track of my spending and budgeting, and I want to improve my coding skills. There are plenty of Excel templates or budgeting apps like Mint that allow people to make and maintain budgets. Personally, I feel constrained by these products because they don’t give me the level of flexibility and control that I’m looking for. At the same time, I want to expand my coding skills and revisit some coding concepts that I didn’t fully grasp while in grad school.
I want to solidify my understanding of user-generated functions, for-loops
, and iteration by building a solution from scratch. As I was developing this project, a useful phrase occurred to me about how to shift my thinking about data science techniques. Using “data structures as tools” is a significant step for my knowledge and skills. Specifically, I will be using the list structure in R to create the functionality I want from my budgeting app. This solution is built around the data I get from my bank and suited to my particular needs, but I’m already thinking about how to build a more generalized application to further expand my skills into machine learning.
Another important focus for me is workflow and repository management. I want to get more accustomed to both RStudio Projects and GitHub. I’m also practicing commenting and organizing my code so that it is more accessible. One last ancillary goal is to get practice explaining data concepts to a general audience. I won’t go into extreme detail on every line of code, but I will try to draw out the major concepts to explain my thought process as I worked towards the end goal.
Compile Data
As with any other data project, the first step is simply to look at the data to see how it’s structured and what information is contained within it. Below is a sample of transaction data from my bank. I’ve altered all the information within each cell for privacy, but kept them similar enough to produce the same interactions.
The columns included in my credit card statements are Posted Date
, Reference Number
, Payee
, Address
, and Amount
. I funnel as many expenses as possible through my credit card so that I can generate cash-back rewards over time. This also simplifies my expense tracking to a single output point. Expenses like rent and others that come directly from my checking account are not recorded for this exercise.
After I download all the monthly statements from my bank’s website, I collect each file into a specific folder within my repository. I am using GitHub to manage my version control and to publish this website (for more information on GitHub and repositories click here). Using folders and logical directory structures makes referencing files in code much easier and enables the iterative list functionality mentioned before. For example, the code below combines monthly statements into a single file.
Code
list.files(path = "data/CC_stmt_raw/",
pattern = "*.csv",
full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows()
The list.files
function produces a character vector of all the files in a location specified by the path
argument that also match the text in the pattern
argument. Thanks to RStudio Projects, I only need to specify the top-level folder within the project folder instead of a full Windows drive directory. A logical file directory structure allows me to easily reference different folders that contain files needed for an operation. In this case, I’m referencing a subfolder called CC_stmt_raw within the top-level data folder that contains all my .csv files of credit card transactions for different months. The output of the first function looks like this:
Code
list.files(path = "data/CC_stmt_raw/",
pattern = "*.csv")
[1] "April2022.csv" "August2022.csv" "December2022.csv"
[4] "February2022.csv" "February2023.csv" "January2022.csv"
[7] "January2023.csv" "July2022.csv" "June2022.csv"
[10] "March2022.csv" "March2023.csv" "May2022.csv"
[13] "November2022.csv" "October2022.csv" "September2022.csv"
Each element in the vector ends with .csv, and the *
symbol before .csv is a regular expression that matches any other characters preceding .csv.
The next two functions in the sequence use the character vector to read the contents of each file and combine them into a single dataframe. lapply
applies a function across a vector, in this case read_csv
is applied to all file paths in the character vector created by list.files
. The output from this function is a list of 15 dataframes (corresponding to the number of files/months in the subfolder) contained in a list. bind_rows
adds each dataframe into one large dataframe by appending rows vertically.
Code
lapply(read_csv) %>%
bind_rows()
Next, I do some basic data wrangling to organize the dataframe the way I want, and assign it to a named object, CC_stmt_allmonths. All together, the code sequence or “pipe” looks like this:
Code
<-
CC_stmt_allmonths list.files(path = "data/CC_stmt_raw/",
pattern = "*.csv",
full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows() %>%
# data wrangling
select(-Address) %>%
rename("date" = `Posted Date`,
"id" = `Reference Number`,
"source" = Payee,
"amount" = Amount) %>%
# convert to proper date format
mutate(date = mdy(date)) %>%
# remove payments made to CC balance
filter(!str_detect(source, "PAYMENT"),
# adjust date range
>= as.Date("2022-01-01")) %>%
date arrange(date)
First Attempt
Now that I have all my monthly statements compiled into a single dataframe, I need to label each transaction. This is where things started to go sideways for me. Ultimately, this is a lesson in doing things the hard way to inspire learning how to code efficiently. The problem I’m facing with this dataset is that it is information poor despite being somewhat long.
nrow
counts the number of rows in a dataframe:
Code
nrow(CC_stmt_allmonths)
[1] 1094
names
shows the column names in a dataframe:
Code
names(CC_stmt_allmonths)
[1] "date" "id" "source" "amount"
My goal is not necessarily to reduce work at this stage in development. I am aiming for control and accuracy in how transactions are categorized. The most direct and precise way to do this is by manually assigning values to each row in Excel. In my case, it would take 1094 separate entries across two columns to start, then labeling transactions that come in every month. That method may help me to keep better track of my spending, but would eliminate the coding challenge and… it’s a complete pain in the ass.
My first idea was to look for recurring patterns in source
, which represents the vendor that received the payment. Three patterns jumped out at me immediately. Transactions made using the Square payment system have “SQ” in the vendor name, and payments handled by Toast show up as “TST”. Amazon transactions appear in different formats but all contain “amzn.com”. To apply the same label to all of these transactions, I used the stringr
and dplyr
packages to create a new category
variable based on source
.
Code
%>%
CC_stmt_allmonths mutate(category = case_when(
str_detect(source, fixed("sq", ignore_case = T)) ~ "Food",
str_detect(source, fixed("tst", ignore_case = T)) ~ "Food",
str_detect(source, fixed("amzn.com", ignore_case = T)) ~ "Amazon",
TRUE ~ "other")
%>%
) filter(category != "other") %>%
nrow()
[1] 410
dplyr
supplies the mutate
function that generates a new variable and the case_when
function that specifies a condition. In this example, I’m using str_detect
from the stringr
package to match a string (either “sq”, “tst”, or “amzn.com”) to determine the category label. I was initially very encouraged by this development because it categorized 410 out of my original 1094 transactions with just a few lines of code. If other patterns could help to identify large groups of transactions, I could continue reducing the number of unlabeled transactions until none remained.
Unfortunately, the inconsistency and proliferation of vendor names meant that I had to produce over two hundred string detection rules to label each transaction with a category and subcategory. As I was typing and pasting all these lines, I kept making clerical errors while getting frustrated and fatigued. I knew there had to be a better way, but convinced myself to use a temporary solution while I figured that out.
Ultimately, I’m glad that I went through the frustration of this misguided process because it gave me the experience of coding an iteration by hand. Adding a new line to label each transactions’ category and another line to label each transactions’ subcategory was labor-intensive, difficult to organize, and error-prone. Exactly what I was trying to avoid. I decided to abandon this method almost after the very moment I completed it.
Taking a Step Back
To gain accuracy and any amount of efficiency, I had to take a different approach. I knew that eventually I wanted to employ some sort of looping function to label transactions, but wasn’t very comfortable with the technique and didn’t know how to implement it. The experience of writing over 200 case_when
rules convinced me that I had to figure it out now. Plus, it would give me the opportunity to use a personal project to make progress on mastering a concept that had been evading me for years.
The script I ended up using employs a similar technique to read and compile the .csv files for each month’s expenses. I was able to plug the lapply
function into a sequence of other functions because list.files
and read_csv
perform specific predefined operations. After a few days of experimenting, googling, and even using rtutor.ai, I realized the best solution for me was to preprocess the data manually using Excel then design my code around the initial setup.
I resisted doing anything in Excel because it seemed to defeat the purpose of learning to code by taking on new data science challenges. I also wanted to avoid designing a process that required more than one program. The idea here is to replicate expense trackers that automatically label transactions, not take detours into Excel midway through a process. When I took a step back and thought about what I was trying to accomplish and what a solution would entail, I had another important realization about the flaw in my approach.
By narrowly focusing on one tool, I limited my options. With Excel, I can place each transaction in a sheet named corresponding to its subcategory. Using the readxl
package in R, I can extract the sheet names as a character vector, then use a for-loop
to iteratively label each transaction according to its position among the preprocessed Excel sheets. I also realized that I didn’t have to think linearly by writing my code to assign the primary category first. Since each primary category is a collection of subcategories, I can easily correlate them using a simple character vector with case_when
.
Before starting deciding on Excel, I thought it would be another long tedious process. But Excel has sorting functions and hotkeys that make navigating between sheets while moving data cells around go pretty quickly. Additionally, pre-processing enables other techniques that will pay off later. Using R to accomplish the same thing was much less effective and wildly less efficient.
Coding the Solution
Before I can code the solution I have in mind, I have to preprocess my data. Compiling the data combines each month’s transactions into a single dataframe 1094 rows long. Instead of manipulating every transaction by row, I only need the unique values in the source
column. I can grab those and create a new Excel file using the code below:
Code
# save unique values in new csv
write_csv(as_tibble(unique(CC_stmt_allmonths$source)),"data/expense_sources.csv")
This step reduces the number of transactions I need to preprocess down to 323. Those 323 unique values will act as a key to assign categories to the full dataset. Keep in mind that the variable source
refers to the source of the transaction, that is, the vendor name like a gas station or bookstore or internet subscription.
Next, I opened that file and a blank .xlsx file in Excel. In the new .xlsx file, I added 12 sheets and labeled each corresponding to the subcategory I want to use. I then used Excel’s sorting features and hotkeys to copy vendor names from the full dataset into smaller groups broken out across the 12 sheets, and voila, preprocessing is complete! This took all of about an hour once I figured out the hotkeys to simplify the work, and the most of that time was spent was deciding what labels to use and where to place various transactions. The previous method in R took several hours across multiple days, and it still didn’t achieve a satisfactory result.
The subcategories I chose are: (“housing”, “utilities”, “transportation”, “household”, “health”, “groceries”, “eating_out”, “amazon”, “entertainment”, “technology”, “gift”, “misc”). With readxl::excel_sheets()
, I can easily extract sheet names from the Excel file, which I saved in my data folder as expense_sources_bysubcat. If I want to change my subcategories, I just need to alter the sheet labels in the newly created Excel file, and I can simply add new entries each month through the same manual preprocessing step. For now, this a workable solution that fits neatly into my workflow. I’ll also write three character vectors to define the primary categories at this stage.
Code
# create file path string
<- "data/expense_sources_bysubcat.xlsx"
path
# grab sheet / subcategory names
<- excel_sheets(path)
subcats_vector trimws(subcats_vector)
[1] "housing" "utilities" "transportation" "household"
[5] "health" "groceries" "eating_out" "amazon"
[9] "entertainment" "technology" "gift" "misc"
Code
# preliminary step, assigning subcategories to primary categories
<- c("housing", "transportation", "utilities", "health", "household")
upkeep <- c("groceries", "eating_out")
food <- c("entertainment", "gift") discretionary
Next, I convert expense_sources_bysubcat to an R list object, again using lapply
, this time with a user defined function wrapped around readxl::read_excel
. The code below uses subcats_vector
to pull the different sheets from the Excel file defined by path
into a list.
Code
# extract content from excel sheets (expense sources / vendor names)
# place in list structure
<- lapply(subcats_vector, function(x){
expense_source_list read_excel(path, sheet = x)
})
# apply subcategory names to list positions
names(expense_source_list) <- strsplit(subcats_vector, ",")
Now, I have a list structured in such a way that I can iterate over it with a for-loop
to build a subcategory key. Each list element contains all the unique vendor names that correspond to a particular subcategory which are specified by the names
function in the previous code block.
The code below builds a dataframe by iterating over the list elements (unique vendor names) and the list element names (“housing”, “groceries”, etc.) to assign subcategory labels according to where the transaction is located in the list. i
indicates the index of the list elements over which the loop iterates.
Code
# create empty data frame
<- data.frame()
source_to_subcat
# iterate over list elements and names to build dataset
for (i in 1:length(expense_source_list)){
# extract information from list structure
<- data.frame(
x # unique vendor names from list element
expense_source_list[i],
# subcategory from list name
names(expense_source_list)[i])
# specify column names and restrict range to two columns
names(x) <- c("source", "subcategory")
# vertically append rows
<- bind_rows(source_to_subcat, x)
source_to_subcat
}
kable(head(source_to_subcat, 10))
source | subcategory |
---|---|
LEMONADE I* LEMONADE I LEMONADE.COM NY | housing |
MOVINGHELP.COM 800-789-3638 AZ | housing |
COMCAST CHICAGO 800-COMCAST IL | utilities |
7-ELEVEN 35808 AUSTIN TX | transportation |
BYT* CMTA AUSTIN TX | transportation |
CAP METRO 5124741200 TX | transportation |
CAP METRO AUSTIN TX | transportation |
CAPITAL METROPOLITAN T 512-389-7564 TX | transportation |
CHICAGO O’HARE INTERNA CHICAGO IL | transportation |
ENTERPRISE RENT-A-CAR AUSTIN TX | transportation |
Now that each unique vendor name has a subcategory attached, I can match a primary category in another column. I achieve this by revising the case_when
solution in a more constrained and reasonable way.
Code
# match subcategory to primary category
<- source_to_subcat %>%
source_to_primecat mutate(primecategory = case_when(subcategory %in% upkeep ~ "upkeep",
%in% food ~ "food",
subcategory %in% discretionary ~ "discretionary",
subcategory
# everything else goes into `other`
TRUE ~ "other"))
kable(head(source_to_primecat, 10))
source | subcategory | primecategory |
---|---|---|
LEMONADE I* LEMONADE I LEMONADE.COM NY | housing | upkeep |
MOVINGHELP.COM 800-789-3638 AZ | housing | upkeep |
COMCAST CHICAGO 800-COMCAST IL | utilities | upkeep |
7-ELEVEN 35808 AUSTIN TX | transportation | upkeep |
BYT* CMTA AUSTIN TX | transportation | upkeep |
CAP METRO 5124741200 TX | transportation | upkeep |
CAP METRO AUSTIN TX | transportation | upkeep |
CAPITAL METROPOLITAN T 512-389-7564 TX | transportation | upkeep |
CHICAGO O’HARE INTERNA CHICAGO IL | transportation | upkeep |
ENTERPRISE RENT-A-CAR AUSTIN TX | transportation | upkeep |
Finally, I use the source_to_primecat
key of unique vendor names and category labels with left_join
to apply labels to the full dataset.
Code
# merge category key with full dataset
<- CC_stmt_allmonths %>%
expenses_labled left_join(source_to_primecat)
kable(head(expenses_labled, 10))
date | id | source | amount | subcategory | primecategory |
---|---|---|---|---|---|
2022-01-03 | 2.469216e+22 | COMCAST CHICAGO 800-COMCAST IL | -11.26 | utilities | upkeep |
2022-01-03 | 2.449216e+22 | EVENT* TIM AND ERIC BR VIEWSTUB.COM FL | -11.95 | entertainment | discretionary |
2022-01-03 | 2.469216e+22 | SQ *TRUENORTH CAFE HYDE PChicago IL | -5.13 | eating_out | food |
2022-01-04 | 2.449398e+22 | TRADER JOE’S #706 QPS CHICAGO IL | -30.17 | groceries | food |
2022-01-04 | 2.444500e+22 | METRA MOBILE CHICAGO IL | -4.00 | transportation | upkeep |
2022-01-04 | 2.416407e+22 | FEDEX OFFIC36000036095 CHICAGO IL | -4.54 | household | upkeep |
2022-01-04 | 2.469216e+22 | FARMER’S FRIDGE CHICAGO IL | -5.47 | eating_out | food |
2022-01-04 | 2.469216e+22 | SQ *TRUENORTH CAFE HYDE PChicago IL | -5.80 | eating_out | food |
2022-01-04 | 2.469216e+22 | AMZN Mktp US*FC9T38QM3 Amzn.com/billWA | -17.63 | amazon | other |
2022-01-05 | 2.413746e+22 | CVS/PHARMACY #05836 CHICAGO IL | -17.67 | household | upkeep |
Reflection
Nothing is quite so edifying as building a tool to resolve a personal need. Not only is this work product a practical application, it’s also built around my own real world data and empowers me to better track my spending. Crafting the solution gave me an opportunity to overcome an obstacle in my own coding comprehension and identify a flaw in my problem-solving approach.
Solving the core problem of this exercise required me to expand my thinking about data structures. They aren’t just objects to be managed and manipulated, they can be used as tools to build other data structures. This project gave me a much stronger understanding of a fundamental data concept. I now feel confident that I can design a data structure to perform a task through iteration instead of solely relying on existing functions. Much like an assembly line is comprised of discrete actions that result in a fully built product, a for-loop
can use a designed data structure like a recipe to assemble a dataframe.
The next phase I have in mind for this project is to train a machine learning model to eliminate the manual preprocessing step. I have some exposure to machine learning, but I’ve never built an application. The real payoff of the manual preprocessing step is that I now have a fully realized training dataset. I am excited to continue learning while getting some practical experience in a critically important field. My plan is to use this textbook and/or tidymodels to educate myself. Until then, I am very happy with my progress and am looking forward to phase two.