Skip to content

eirikbrandsaas/PanelDataTools.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

134 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PanelDataTools

Build Status Coverage

Introduction

This package aims to introduce some convenience tools for working with Panel Data in the DataFrames.jl world in Julia. The package currently supports lags, leads, diffs, seasonal diffs, spell analysis, and panel structure checks.

The package is inspired by Stata's great panel data features such as tsspell and lag/lead/difference operators L., F.,S., and D.. It relies on DataFrames.jl and PanelShift.jl. The goal is to provide results that are correct and easy to obtain.

Workflow:

  1. First set the :id and :t variables using paneldf!(df::DataFrame, id::Symbol, t::Symbol)
  2. To find the lagged :x column use lag!(df,:x).
    • If you don't want to store metadata you can pass the id and t column names directly: lag!(df,:id,:t,:a))
  3. See below for examples of lags, leads, and diffs (where you can specify multiple colomns or time gaps), how to find spells (e.g., unemployment spells), how to fill time gaps, and giving names to new columns

Examples working with Dates/Time:

The default time gap ("Delta") is 1 oneunit as determined by oneunit(). For Int, Date, and DateTime this defaults to 1, 1 day, and 1 millisecond, respectively. If these are not the gaps you want, you must specify the correct gaps.

For example, here we have daily data with many gaps:

using PanelDataTools,DataFrames,Dates
df = DataFrame(id=fill(1,4),
    t=[Date(2000,1,1),Date(2000,1,2),Date(2000,2,1),Date(2001,1,1)],
    a=[0,1,1,1])
paneldf!(df,:id,:t; verbose=true) # verbose=true prints the inferred panel/time/delta

panel variable: id
 time variable: t
         delta: 1 day
      balanced: true
          gaps: true
    duplicates: false

Next, we create lags of a with various gaps

lag!(df,:a,Day(1),name="L(Day=1)")
lag!(df,:a,Month(1),name="L(Month=1)")
lag!(df,:a,Day(366),name="L(Day=366)") # 366 days = one year (2000 was a leap year)
lag!(df,:a,Month(12),name="L(Month=12)") # 12 months = one year
lag!(df,:a,Year(1),name="L(Year=1)")
lag!(df,:a) # Default (picks time gap of 1 day and names the column "L1a")
display(df)

which gives:

4×9 DataFrame
 Row │ id     t           a      L(Day=1)  L(Month=1)  L(Day=366)  L(Month=12)  L(Year=1)  L1a     
     │ Int64  Date        Int64  Int64?    Int64?      Int64?      Int64?       Int64?     Int64?  
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   11  2000-01-01      0   missing     missing     missing      missing    missing  missing 
   21  2000-01-02      1         0     missing     missing      missing    missing        0
   31  2000-02-01      1   missing           0     missing      missing    missing  missing 
   41  2001-01-01      1   missing     missing           0            0          0  missing 

Shifts: Leads and Lags

Easily create leads, lags, diffs, and seasonal diffs from panels:

using PanelDataTools, DataFrames
df = DataFrame(id = [1,1,1,2,2,2], t = [1,2,3,1,2,3], a = [0,0,1,1,1,0])
paneldf!(df,:id,:t)

lag!(df,:a)
lead!(df,:a)
lead!(df,:a,2) # last argument is how many lags
display(df)
6×6 DataFrame
 Row │ id     t      a      L1a      F1a      F2a
     │ Int64  Int64  Int64  Int64?   Int64?   Int64?
─────┼────────────────────────────────────────────────
   11      1      0  missing        0        1
   21      2      0        0        1  missing
   31      3      1        0  missing  missing
   42      1      1  missing        1        0
   52      2      1        1        0  missing
   62      3      0        1  missing  missing

or as a one-liner specifying multiple lead lags OR multiplate variables at a specific shift:

lag!(df,:id,:t,:a,[-2,-1,1]) # -2 and -1 becomes leads of a
lead!(df,:id,:t,:a,[-1,1,2]) # -1 becomes a lag
lag!(df,:id,:t,[:a,:b,:c],2) # Find lags of a,b, and c

Differences

There is also support for "seasonal" and difference operators mimicking Stata's S.x and D.x syntax:

df = DataFrame(id = [1,1,1,2,2,2], t = [1,2,3,1,2,3], a = [1,1,1,1,0,0])
paneldf!(df,:id,:t)
diff!(df,:a,1)
diff!(df,:a,2)
seasdiff!(df,:a,1)
seasdiff!(df,:a,2)
display(df)
 Row │ id     t      a      D1a      D2a      S1a      S2a
     │ Int64  Int64  Int64  Int64?   Int64?   Int64?   Int64?
─────┼─────────────────────────────────────────────────────────
   11      1      1  missing  missing  missing  missing
   21      2      1        0  missing        0  missing
   31      3      1        0        0        0        0
   42      1      1  missing  missing  missing  missing
   52      2      0       -1  missing       -1  missing
   62      3      0        0        1        0       -1

Provide Names

You can also create new variable names by adding the name="FancyName" keyword argument:

lag!(df,:a,name="FancyName")

Note that this only works operating over a single column.

Spells

or to obtain spells as in tsspell in Stata:

df = DataFrame(id = [1,1,1,2,2,2], t = [1,2,3,1,2,3], a = [0,0,1,1,1,0])
paneldf!(df,:id,:t)
spell!(df,:a)
display(df)
6×6 DataFrame
 Row │ id     t      a      _spell  _seq
     │ Int64  Int64  Int64  Int64   Int64
─────┼────────────────────────────────────
   11      1      0       1      1
   21      2      0       1      2
   31      3      1       2      1
   42      1      1       1      1
   52      2      1       1      2
   62      3      0       2      1

Filling in gaps (tsfill)

tsfill is used to fill in gaps of the time variable. You do not need to use it to construct the correct leads, lags or differences.

df = DataFrame(id = [1,1,2,2,2], t = [1,3,1,2,3], a = [1,1,1,0,0]) # Note, missing t=2 for id=1
paneldf!(df,:id,:t)
dfn = tsfill(df) # Since tsfill extends columns in the DataFrame is does not operate inplace
6×3 DataFrame
 Row │ id     t      a
     │ Int64  Int64  Int64?
─────┼───────────────────────
   11      1        1
   21      2  missing
   31      3        1
   42      1        1
   52      2        0
   62      3        0

or maybe you want to fill in with time gaps of 0.5:

dfn_half = tsfill(df,0.5)
 Row │ id     t        a       
     │ Int64  Float64  Int64?  
─────┼─────────────────────────
   11      1.0        1
   21      1.5  missing 
   31      2.0  missing 
                        
   92      2.5  missing 
  102      3.0        0

Checking panel structure (checkpanel)

checkpanel reports whether a panel is balanced (every id observed at the same time points), whether any id has gaps in its time series, and whether there are duplicate (id, t) pairs. It always prints the summary and returns a NamedTuple:

df = DataFrame(id = [1,1,2,2,2], t = [1,3,1,2,3], a = [1,1,1,0,0]) # id=1 is missing t=2
paneldf!(df,:id,:t)
checkpanel(df)
      balanced: false
          gaps: true
    duplicates: false

The individual checks are also available as predicates returning a Bool: isbalanced(df), hasgaps(df), and hasduplicates(df) (each also accepts explicit :id/:t columns). Unbalanced and gapped panels are fully supported by the rest of the package, so they are reported for information only. Duplicate (id, t) pairs, however, make lags/diffs/spells unreliable, so they always trigger a warning:

df = DataFrame(id = [1,1,2], t = [1,1,2], a = [0,1,0]) # id=1 has two t=1 rows
checkpanel(df,:id,:t)
      balanced: false
          gaps: false
    duplicates: true
┌ Warning: Panel has duplicate (id, t) pairs; lags/diffs/spells will be unreliable
└ @ PanelDataTools

Relevant links and packages

Possible future features

  • Link with GLM or FixedEffectModels so that you can specify a model with lags (Model(y ~ x + F.a))

About

Convenient functions for working with panel (longitudinal) data in Julia

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages