Thinking outside the box: Leveraging UNIX Tools (GNU) for Data Analysis and Validation
September 23, 2024: 4:00 AM - 5:00 AM
Data Collection, Management & Manipulation, Brookside B

Authors Abstract
David Horvath What do you do when you get a file that is too big to look at under windows (too wide a record or too many records or both)? How do you know the CSV or TAB-delimited file really is? When you get a fixed format file, how do you figure out what the field at position 468 really looks like (what data it contains)? I end up using tools available under UNIX/Linux/POSIX to help me look at the data and perform some validation. Those commands include head, tail, more (or less or pg), view, od, wc, dd conv (ASCII/EBCDIC conversions), and even simple AWK scripts. While these commands may look like gibberish now, they will make sense by the time this session is done. And when there are data issues, finding the specific record that caused the problem can be difficult under Windows. These tools are available for the Windows/PC environment too. The GNU-based Cygwin environment comes in very handy! And for those who prefer the MAC OSX operating system, you already have these tools!

Paper