SOUTHEAST SAS USER GROUP CONFERENCE



Education/Institutional Research

Paper Authors Title Abstract
Paper 41 Peter Zsiga Two-way Survey Analysis for Units and Organization with Means and Percentages Favorable and Unfavorable Employee turnover costs business millions of dollars annually. While customer satisfaction surveys are frequently administered and evaluated, employee satisfaction is often neglected. In fields such as medicine and education, retention of key employees such as nurses and teachers is mission critical. One Florida school district with over 40,000 students and over 5000 employees surveys teachers twice yearly and uses Enterprise Guide® software to produce detailed analysis of 33 questions for over 40 different worksites. Separate surveys for 24 departments are also distributed twice a year to school leadership and the returns analyzed to evaluate organizational-wide services and support. Although the techniques are the same, only the larger teacher survey is discussed in this article. Basic to intermediate level SAS users should find something new and applicable to any business seeking to improve employee satisfaction.
Paper 65 Kelly Smith Moving beyond Frequency and Percentage to Chi Square, t-Tests, and Correlation Analysis. When institutional research offices are overwhelmed with data requests, outgoing reports may be limited to frequencies and percentages. While an important first step, additional analysis and a deeper understanding of the data are just a few steps away. In this presentation, a typical data request for frequencies and percentages will be taken to the next level through the use of PROC FREQ, PROC CORR, and PROC TTEST. Data sets obtained from the UCI Machine Learning Repository are used for analysis. The UCI Student Performance data sets represent actual academic and demographic data of students from two Portuguese secondary schools. Participants will receive SAS code for analysis and visualization.
Paper 81 Maham Khan, James Farley, Zhong Zheng and Glendalis Gonzales Understanding the Effects of Campus Safety on College Student Retention and Completion: A Panel Data Analysis * College student retention is influenced by social forces other than students' own academic performance. Previous research has revealed that students do not prioritize college crime in the college selection process, compared to their parents (Warwick & Mansfield, 2006). However, it is unknown whether campus safety would have an impact on students' decision to stay at the same college after the first year, transfer to another college, as well as the ability to graduate from the college. Using joined panel data from College Scorecard & Campus Safety and Security Data Analysis Cutting Tool, we analyze the predictors of retention and completion at R1 universities in the United States. The data was curated and analyzed using different statistical procedures in SAS® Enterprise Guide® and SAS® Viya®. We find campus crime is not associated with student retention, transfer, or completion. Private universities and higher SAT scores are related to higher first-year retention rates. Higher SAT scores, a higher percent of Pell grant awardees, and lower affordability of the universities are associated with increased 4-year completion rates. Lower SAT scores and a lower proportion of full-time faculty correlate with greater 6-year transfer rates. Higher SAT scores, a higher percent of Pell grant awardees, lower affordability, and a higher percent of female students are connected with increased 6-year completion rates. Results indicate that universities attracting high achieving students have higher retention and graduation rates and lower transfer rates. Expensive colleges that are predominantly private and elite report greater graduation rates and smaller transfer rates. Our analysis shows that campus crime is not a deciding factor for continuing education. Students might give quality of education, the profile of the faculty, and other opportunities more importance while choosing their higher education pathways.



Healthcare/Pharmaceuticals

Paper Authors Title Abstract
Paper 24 Hengwei Liu Different Ways to Create Patient Profiles In the patient profiles, the demographics and baseline characteristics, the safety information such as adverse events, concomitant medication and laboratory results are all closely and neatly packed into a few pages. This is different from the way we create with SAS® for clinical study report the after-text tables where we display only one table on each page. The ODS LAYOUT in SAS, R shiny, and R markdown can be used to stack multiple tables on top of each other. The data _null_ with put statement in SAS can be used to output an HTML file for this purpose as well. In this paper we discuss these different methods for the generation of patient profiles.
Paper 68 Tamar Roomian Using PROC SQL to restructure data for common healthcare applications Electronic medical record data is commonly stored in "long" format (multiple rows per an id, where one column stores many pieces of information on the same id). However, for many applications, data must be transposed to "wide" format (one row per id with multiple columns representing each piece of information needed). This paper will present how to use PROC SQL to transpose data from long to wide using a common example from healthcare utilizing electronic medical record data. This paper will discuss the advantages and disadvantages to using PROC SQL over proc transpose. In addition, it is commonly required to join two sources of data based on logic that depends on dates. Example questions could include: Which patients received a depression diagnosis 4 months prior to their emergency visit? Which patients had an encounter for a fall within 2 years from their encounter for a fracture? The data step merge statement requires equal conditions using the by statement. PROC SQL does not require equal conditions and therefore can be used to join data with unequal date conditions. Unequal join conditions using dates will be shown continuing the healthcare example. This paper will present how to use PROC SQL to determine encounters that happened before or after an index date, and second, to determine encounters that happened within a specified time frame from an index date. The INTNX function will also be briefly introduced. This presentation is aimed at beginner to intermediate SAS programmers who already have a basic knowledge of PROC SQL and are looking to expand their SQL abilities.
Paper 71 Hong Zhang and Huei-Ling Chen Page Margin Checking Macro for RTF files The Rich Text Format (RTF) file is the most popular output file format for clinical study reports. Different companies may have different standards for the format of Tables, Listings, and Figures (TLF) in the pharmaceutical industry. One such standard would be a general table format (e.g., title, footnote, table width, text font, border, or borderless). Other standards include page formats, for instance, page orientation and page margins. It is crucial to set up margin standards on every side of the page to avoid obscuring information when printed and bound. Thus, a programmer needs to ensure page orientation and margins of all RTF files in a deliverable package meet the criteria. Manually checking each file is not very practical, so we derived a macro for the task. This macro is easy to use to check RTF files from one or multiple folders. This paper will demonstrate the technique utilized in this macro and introduce the key RTF syntax that related page orientation and page margins.
Paper 86 Jayanth Iyengar NHANES Dietary Supplement component: a parallel programming project The National Health and Nutrition Examination Survey (NHANES) contains many sections and components which report on and assess the nation's health status. A team of IT specialists and computer systems analysts handle data processing, quality control, and quality assurance for the survey. The most complex section of NHANES is dietary supplements, from which five publicly released data sets are derived. Because of its complexity, the Dietary Supplements section is assigned to two SAS programmers who are responsible for completing the project independently. This paper reviews the process for producing the Dietary Supplements section of NHANES, a parallel programming project, conducted by the National Center for Health Statistics, a center of the Centers for Disease Control (CDC).



Know Your SAS: Advanced Techniques

Paper Authors Title Abstract
Paper 7 Bruce Gilsen Copying Data Between SAS ® and JSON Files JavaScript Object Notation (JSON) is an open standard file format and data interchange format used for some of the same purposes as XML. More information about JSON is readily available on the internet. Starting in SAS 9.4, you can copy SAS data sets to JSON files with PROC JSON. Starting in SAS 9.4TS1M4, you can copy JSON files to SAS data sets with the JSON engine. Copying data from SAS to JSON with PROC JSON is relatively straightforward. Copying data from JSON to SAS can be much more complicated in some cases. To the extent possible, the examples in this paper copy JSON files to SAS in an automated way. Determining how to copy additional types of JSON files into SAS in an automated way is an area of ongoing research. Reading JSONL files into SAS and the JSONPP DATA step function, which converts a single record JSON file to a "pretty" JSON file, are also discussed. This paper provides basic information and some examples that use small amounts of data.
Paper 8 Ronald Fehd Using LaTeX document class sugconf to write your paper SAS software international conference, SAS Global Forum (SGF), now accept papers written with LaTeX typesetting software. This paper illustrates use of the document class sugconf, provides a basic paper template and references to advanced usage of LaTeX. audience: SAS user group authors, particularly those using Jupyter Notebook
Paper 33 Kirk Paul Lafler Under the Hood: The Mechanics of SQL Query Optimization Techniques The SAS® software and SQL procedure provide powerful features and options for users to gain a better understanding of what's taking place during query processing. This presentation explores the fully supported SAS® MSGLEVEL=I system option and PROC SQL _METHOD option to display valuable informational messages on the SAS® Log about the SQL optimizer's execution plan as it relates to processing SQL queries; along with an assortment of query optimization techniques.
Paper 42 Thomas Billings SAS® Macro to Identify Potentially Obsolete Variables in a File The problem: you have SAS datasets and/or database tables, and want to identify variables that might be obsolete. A SAS macro is presented that examines all the variables in each row of a file and determines the earliest and latest as-of dates/datetimes when each variable is populated. The macro requires that the file has an as-of date/datetime variable in each row, and that variable cannot be missing/null. The macro creates a SAS dataset with the derived by-variable metadata, and those metadata are easily tested to identify variables that may be obsolete/deprecated.
Paper 49 David Horvath X Command Tips and Tricks under UNIX/LINUX SAS provides the ability to execute operating system level commands from within your SAS code - generically known as the "X Command". This session explores the various commands, the advantages and disadvantages of each, and their alternatives. The focus is on UNIX/Linux but much of the same applies to Windows as well. Under SAS EG, any issued commands execute on the SAS engine, not necessarily on the PC.
  • X
  • %sysexec
  • Call system
  • Systask command
  • Filename pipe
  • &SYSRC
  • Waitfor

Alternatives will also be addressed - how to handle when NOXCMD is the default for your installation, saving results, and error checking.
Paper 56 Kiran Venna Finding variable names and count of variables with missing values at various positions within an observation using SAS arrays. SAS arrays make it easy to find variable names or count of variables with missing values at various positions within an observation. In this paper, we will discuss four different scenarios of finding variable names/counts of missing values. The first scenario is finding all variable names which have missing values within an observation. The second one is finding the variable name of the first occurrence or any other specific occurrence of missing value within an observation. The third one is finding the total count of consecutive missing values only from the beginning and end of observation. Finally, the last one is finding the count of missing variables that are present in observation after at least one non-missing value. All these scenarios will be discussed with relevant examples along with sample data and code.
Paper 69 William Smith Utilizing SAS Macros to Deduplicate Your Data Data deduplication is an imperative step for researchers when producing high-quality data products. If survey respondents do not have a unique identifier, data are particularly prone to data duplication since participants could have completed a survey multiple times. To produce quality data, researchers must identify and remove duplicate records. This paper explains the use of a set of SAS macros that create matching scores between every possible combinational pair of respondents in a dataset. These macros allow users to input specific datasets, variables, and score criteria that will be used to identify potential duplicates. The output of the macro will provide the user with a list of potential matches based on their specifications, sorted by match probability to provide researchers with a quick and efficient way to deduplicate their datasets.
Paper 80 Zach Acuff Coarsening Continuous Variables: An Automated Method to Create Categorical Versions of Continuous Variables While Mitigating Loss of Precision Researchers often want to create a categorical version of a continuous variable, perhaps for analytical purposes or to make potentially identifying information more secure for public consumption. Although there are already several ways to categorize continuous data, going from a continuous to a categorical variable necessarily involve some loss of precision that may not be desirable. This paper describes a SAS® macro which presents a novel method for automatically categorizing continuous data with minimal user input. This method creates a categorical version of a continuous variable such that 1) each categorical value occurs at least as often as the user desires, and 2) the total difference between the actual continuous values and their categorical counterparts is minimized to the extent the macro is able.
Paper 87 Jayanth Iyengar FROM %LET TO %LOCAL; METHODS, USE, AND SCOPE OF MACRO VARIABLES IN SAS PROGRAMMING Macro variables are one of the powerful capabilities of the SAS system. Utilizing them makes your SAS code more dynamic. There are multiple ways to define and reference macro variables in your SAS code; from %LET and CALL SYMPUT to PROC SQL INTO. There are also several kinds of macro variables, in terms of scope and other ways. Not every SAS programmer is knowledgeable about the nuances of macro variables. In this paper, I explore the methods for defining and using macro variables. I also discuss the nuances of macro variable scope, and the kinds of macro variables from user-defined to automatic.
Paper 93 Troy Hughes Yo Mama is Broke Cause Yo Daddy is Missing: Autonomously and Responsibly Responding to Missing or Invalid SAS® Data Sets Through Exception Handling Routines Exception handling routines describe processes that can autonomously, proactively, and consistently identify and respond to threats to software reliability, by dynamically shifting process flow and by often notifying stakeholders of perceived threats or failures. Especially where software (including its resultant data products) supports critical infrastructure, has downstream processes, supports dependent users, or must otherwise be robust to failure, comprehensive exception handling can greatly improve software quality and performance. This text introduces Base SAS® defensive programming techniques that identify when data sets are missing, exclusively locked, or inadequately populated. The use of user-defined return codes as well as the &SYSCC (system current condition) automatic macro variable is demonstrated, facilitating the programmatic identification of warnings and runtime errors. This best practice eliminates the necessity for SAS practitioners to routinely and repeatedly check the SAS log to evaluate software runtime or completion status. Finally, this text demonstrates wrapping exception handling routines within modular, reusable code blocks that improve both software quality and functionality.
Paper 99 Josh Horstman Using the Output Delivery System to Create and Customize Excel Workbooks In years past, SAS output was limited to the text-based SAS listing. However, the Output Delivery System (ODS) greatly enhanced the capabilities of the SAS system by allowing users to create highly-customizable output in a variety of document formats, including Microsoft Excel workbooks. This paper provides a brief overview of how to use the ODS EXCEL destination to create excel workbooks and how to customize the various visual attributes of the output such as fonts, colors, styles, and much more.



Know Your SAS: Foundations

Paper Authors Title Abstract
Paper 11 Melvin Alexander Using JMP® and R Integration to Analyze Virtual Chat Messages during a Coronavirus Pandemic The Coronavirus pandemic has altered the way communications, in-person meetings, and social gatherings take place. Video conferencing technologies such as Zoom, Skype, and Microsoft® Teams have been the predominant ways of conducting and attending meetings, social networks, conferences. Sentiment Analysis is a new add-in feature which came in version 16 of JMP Pro's Text Explorer platform. With JMP® and R Integration, these same capabilities are also available to base JMP users. This paper presents the use of natural language processing techniques of comparison word clouds and sentiment analysis of Zoom chat messages during a video conferencing session that helped uncover the feelings, opinions, and attitudes of SAS User Group participants about professional development.
Paper 12 Stephen Sloan Twenty Ways to Run your SAS programs faster and use less space When we run SAS® programs that use large amounts of data or have complicated algorithms, we often are frustrated by the amount of time it takes for the programs to run and by the large amount of space required for the program to run to completion. Even experienced SAS programmers sometimes run into this situation, perhaps through the need to produce results quickly, through a change in the data source, through inheriting someone else's programs, or for some other reason. This paper outlines twenty techniques that can reduce the time and space required for a program without requiring an extended period of time for the modifications. The twenty techniques are a mixture of space-saving and time-saving techniques, and many are a combination of the two approaches. They do not require advanced knowledge of SAS, only a reasonable familiarity with Base SAS® and a willingness to delve into the details of the programs. By applying some or all of these techniques, people can gain significant reductions in the space used by their programs and the time it takes them to run. The two concerns are often linked, as programs that require large amounts of space often require more paging to use the available space, and that increases the run time for these programs.
Paper 26 Aaron Brown A SAS® Toolbox for File and Folder Manipulation: Copy, Rename, Delete, or Zip via Functions or X Commands This paper discusses various utilities within SAS® for manipulating folders and files within a Windows environment, including creating folders, copying files, renaming files, deleting files and folders, and zipping folders. It includes examples of using tools like X commands, the DLCREATEDIR option, macros to zip/delete folders, and functions like FDELETE and FCOPY. First we discuss several tools, then show a small project that utilizes them.
Paper 37 Richann Watson and Louise Hadden What Kind of WHICH Do You CHOOSE to be? A typical task for a SAS® practitioner is the creation of a new variable that is based on the value of another variable or string. This task is frequently accomplished by the use of IF-THEN-ELSE statements. However, manually typing a series of IF-THEN-ELSE statements can be time-consuming and tedious, as well as prone to typos or cut and paste errors. Serendipitously, SAS has provided us with an easier way to assign values to a new variable. The WHICH and CHOOSE functions provide a convenient and efficient method for data-driven variable creation.
Paper 46 David Horvath NOBS for Noobs This mini-session will be a short discussion of the NOBS (number of observations) option on the SET statement. This includes one "gotcha" that I've run into with where clauses: NOBS is set before WHERE processing. If you have a reason to know the number of observations after the WHERE clause, another DATA step is needed.
Paper 57 Michael Raithel Using the SAS® HPBIN Procedure to Create Format Value Ranges for Numeric Variables How would you go about determining the format value ranges for a new continuous numeric variable? You could run PROC MEANS to get the min, max, median, and quantiles; and then construct it from those metrics. That works, but it constrains you to having only four value ranges for your format. Also, the MEANS Procedure output is not in a structure conducive to creating the actual SAS Format Start/End/Label statements. It would be advantageous to have a methodology whereby a programmer could choose the desired number of value ranges and generate output close to what is needed for the PROC FORMAT VALUE statements. One such method is to employ a binning technique available through SAS’s High Performance Bin procedure. PROC HPBIN can be used to provide mathematically sound, defensible methodologies for creating the value ranges of numeric variables. Programmers can specify the number of bins (rows) they desire and PROC HPBIN computes the numerical boundaries that can used to define the Start/End values in PROC FORMAT statements. This paper introduces the Continuous Variable Format Start and End Values Creator program; which contains a macro for creating suggested SAS format Start/End values. Users can specify either of two main binning techniques and the macro produces a spreadsheet of computed value ranges. SAS programmers can copy this macro and begin using it right away to define their own numeric formats.
Paper 72 David Bosak The reporter package: A powerful and easy-to-use reporting package for R SAS® programmers who come to R are often disappointed by the reporting options available in R. Creating a report that takes a few minutes in SAS® can take hours in R. Sometimes it appears impossible to create an equivalent report at all. The reporter package was built to overcome the difficulty of reporting in R. This package contains functions to create regulatory-style statistical reports. Originally designed to generate tables, listings, and figures (TLFs) for the pharmaceutical, biotechnology, and medical device industries, these reports are generalized enough that they could be used in any industry. The reporter package can output text, rich-text, and PDF file formats. The package specializes in printing wide and long tables with automatic page wrapping and splitting. Reports can be produced with a minimum of function calls, and without relying on other table packages. The package supports titles, footnotes, page headers, page footers, spanning headers, page by variables, and automatic page numbering. This paper will provide a brief overview of the reporter package. The reader should have some familiarity with the R language and RStudio®.
Paper 73 Deb Cassidy Where Where is a Problem A statistician was reviewing a table and had different results than the programmers. Investigation showed the issue was really with the statistician's use of multiple WHERE's. This paper will show several ways of having multiple WHERE and IF statements. It will show which ones work as expected and which don't. This is an entry level presentation. However, even long-time SAS programmers may learn something they never thought about. The author definitely did. The code in the paper was tested using Enterprise Guide.
Paper 75 Thomas Billings Tips for Input/Handling of Dates in Command Files and Production Production programs often have complex data handling and this makes it difficult to run a process (often outside production) for dates other than the current period. Here we illustrate select methods to make date handling more flexible. Many production programs calculate the target interval starting with the TODAY() function for the current date. We show a very simple code change that supports the option to run for alternate start dates. Checking for the last workday of the month can be messy as holidays may occur; we show a simple method that avoids this issue. Dates may be input using PROC IMPORT from an Excel command file and the dates may come in as character variables instead of dates. We discuss ways to avoid/handle this. Date parameters may be passed via the operating system SAS command invocation. We show easy ways to parse the values and test/set the values using SYSPARM-related features.
Paper 82 Julia Skinner May i? Lessons learned from using nested DO loops in a family card game SAS is used to solve a wide variety of complex problems. However, it can also be used on a smaller scale, and these projects often provide ample opportunities for learning the lessons needed to tackle these larger projects. This paper describes the author's use of SAS to solve a small problem born of family bickering: the wager required per person for a card game. Using nested DO loops, this paper demonstrates how to determine all possible combinations of wagers that satisfy the stakeholders. Despite the small scope and light-hearted nature of the exercise, the process offers several lessons that can be applied to solving more complicated real-world problems.
Paper 83 Rachel Straney and Lesa Caves PROC FORMAT for Scrappy SAS Users and Posh Programmers Whether you are just starting out with SAS® or have been using it for years - PROC FORMAT is one of the most important procedures that you can use. PROC FORMAT is a deceptively powerful procedure that can be used to label and stylize data values in various ways. It also has the ability to be leveraged for data creation. People may use this procedure for some of its fundamental uses, but there are many tools in the format procedure that can optimize your SAS program and save time. This paper will cover an introduction to the procedure, share ways that the procedure can be integrated with data step processing, and additional helpful tricks to make your programming easier.
Paper 84 Jonathan Duggins and Jim Blum PROC REPORT: Tips and Customizations for Quickly Creating Customized Reports Producing high-quality reports is a cornerstone of data science, statistics, and statistical programming careers. While PROC REPORT has been around since SAS 6, there is enough variety in what type of reports it can produce that students and practitioners alike are often unaware of some of its intricacies. This paper begins by reviewing the more similar usages (ORDER and GROUP) and some good practices for their use before going on to explore various applications of COMPUTE blocks. COMPUTE blocks provide multiple ways to customize the aesthetics of a report and so this paper concludes with a look at adjusting column, row, and cell styles based on columns that may or may not appear in the report. Each topic will be demonstrated via examples that show the code and results. Commented code, data sets, and results will be provided as downloadable resources. Attendees should have a basic familiarity with the REPORT Procedure and some experience with conditional logic to get the most out of this presentation.
Paper 85 Ronald Fehd A Configuration File Companion: using environment variables and options The startup process of SAS software reads one or more configuration files, *.cfg, which have allocations of environment variables, the values of which are used in SAS startup-only options to provide access to libraries, sets of folders that contain files that SAS uses for functions, macros, and procedures. This paper provides programmers and advanced users programs to review the default configuration files; procedures, options, and sql to discover options; and a suite of programs to use in Test-Driven Development (TDD) to trace and verify user-written configuration files.
Paper 96 Kent Phelps and Ronda Phelps Base SAS® & SAS® Enterprise Guide®: Automate Your SAS® World with Dynamic Code Communication is the foundation of all relationships, including our SAS relationship with the Server, PC, or Mainframe. To communicate more efficiently ~ and to increasingly automate your SAS World ~ you will want to learn how to transform static code into dynamic code that automatically re-creates the static code, and then executes the re-created static code automatically. Our presentation highlights the powerful partnership that occurs when dynamic code is creatively combined with a dynamic FILENAME statement, the SET INDSNAME option, a Macro variable, and the CALL EXECUTE command within one SAS Enterprise Guide Base SAS program node. You have the exciting opportunity to learn how to design dynamic code forward and backward to re-create static code while automatically changing the year. You will see how 1,784 time-consuming manual steps are amazingly replaced with only 1 time-saving dynamic automated step! This presentation details the UNIX and Microsoft Windows syntax for our project example and introduces you to your newest BFF (Best Friend Forever) in SAS.
Paper 98 Josh Horstman Getting Started with Data Step Hash Objects The hash object provides a powerful and efficient way to store and retrieve data from memory within the context of a DATA step. This presentation will introduce the hash object, cover its basic syntax and usage, and walk through several examples that demonstrate how it can offer new and innovative solutions to complex coding problems. This presentation is intended for SAS users who are already proficient with basic DATA step programming.



Leadership/Team Building/Career Development

Paper Authors Title Abstract
Paper 30 Kirk Paul Lafler Differentiate Yourself Today's job, employment, contracting, and consulting marketplace is highly competitive. As a result, SAS® professionals should do everything they can to differentiate and prepare themselves for the global marketplace by acquiring and enhancing their technical and soft skills. Topics include describing how SAS professionals should assess and enhance their existing skills using an assortment of valuable, and "free", SAS-related content; become involved, volunteer, publish, and speak at in-house, local, regional and international SAS user group meetings and conferences; and publish blog posts, videos, articles, and PDF "white" papers to share knowledge and differentiate themselves from the competition.
Paper 35 Barbara Okerson Asking the Right Questions: Designing Surveys to Produce Valid and Reliable Results Writing questions that produce accurate, reliable and valid assessments of conditions and opinions is critical for any survey. This is not easy. Not only are both the wording and structure of questions important, but also any subtle relationships between questions that could impact how the respondent feels about these questions. Additionally, each of the questions must produce discriminating answers, be unbiased and provide information that serves the goal of the survey. Ideally questions should be pretested to assess reliability and validity but many times today, surveyors do not have either the time or the budget for rigid pre-assessment. This paper provides a checklist for writing survey questions on the fly that can produce the needed results.
Paper 50 Kirk Paul Lafler Exploring the Skills Needed by the Data Scientist As 2.5 quintillion bytes (1 with 18 zeros) of new data are created each and every day, the age of big data has taken on new meaning. More and more organizations across industries are embracing Data Science / Computer Research Scientist skills resulting in an emerging demand for qualified and experienced talent. According to the Bureau of Labor Statistics (BLS) the number of data science jobs is expected to grow 19 percent over the next two decades - nearly three times as fast as the average growth rate for all jobs. Energized by this employment outlook, students and professionals across job functions are preparing for tomorrow's growing data science / analytic demands by acquiring a comprehensive skill set. To prepare for this growing demand, many colleges, junior colleges, Universities, and vocational training organizations offer comprehensive degrees and certificate programs to fulfill the increasing demand for analytical skills. This paper and presentation explores the skills needed by the Data Scientist / Analytics professional including non-technical skills such as critical thinking; business acumen and verbal/written communications; and technical skills such as data access; data wrangling; statistics; use of statistical programming languages like Python, R and SAS®; Structured Query Language (SQL); Microsoft Excel; and data visualization.
Paper 63 Brian Varney What Level am I? A Look at Categorizing a Programmer as a Beginner, Intermediate, or Advanced There are numerous times that one gets categorized by their experience level with SAS and/or as a programmer in general. Whether it be as a company hiring a programmer, a programmer determining if a presentation is appropriate for them, or a project manager building a team, it is valuable to be able to define some guidelines as to someone is a beginner, intermediate, advanced programmer. This paper intends to help this process be less subjective and error prone.
Paper 64 Kelly Smith Successful Communication with Data Phobic Audiences Hard as it is for SAS folk to believe, not everyone loves data. At the same time, "data driven" and "data informed" decision making are now commonly cited as a preferred method for business organizations. Pick up tips and tricks for successfully communicating data in ways that are accessible and relevant for audience members who fear data. Explore how turning numbers into images helps make connections.
Paper 79 Kelly Smith Developing Ethical Data Use and Users Just because we can, should we? Has the ability to analyze data outpaced the growth of data ethics? After a short review of ethics and the current state of data ethics, join a discussion of where we are, where we want to be, and how to get us there. Should data ethics training be required? What are the options to promote ethical data use and deter poor ethical choices?
Paper 94 Troy Hughes Badge in Batch with Honeybadger: Generating Conference Badges with Quick Response (QR) Codes Containing Virtual Contact Cards (vCards) for Automatic Smart Phone Contact List Upload Quick Response (QR) codes are widely used to encode information such as uniform record locators (URLs) for websites, flight passenger data on airline tickets, attendee information on concert tickets, or product information on product packaging. The proliferation of QR codes is due in part to the broad dissemination of smart phones and the accessibility of free smart phone applications that scan QR codes. With the ease of QR code scanning has come another common QR code usage-the identification of conference attendees. Conference badges, emblazoned with attendee-specific QR codes, can communicate attendee contact information to other conference goers, including organizers, vendors, potential customers or employers, and others. Conference badges that contain QR codes make it easy for attendees to link up with each other because snapping a photo of a badge can immediately capture contact information (that could not otherwise be printed on the badge itself). To that end, this text introduces flexible Base SAS® software that dynamically creates attendee QR codes from a data set containing contact and other information. This data-driven approach could be used to create attendee badges by conference organizers rather than costly third-party vendors. When a badge QR code is scanned by a conference goer, the attendee's personal information-including name, job title, company, phone number, email address, city, state, website, and biographical statement-is ported into a variant call format (VCF) file (or vCard) that can be uploaded automatically into a smart phone's contact list. Attendees are able to select what personal information is contained within their QR code and conference organizers are able to customize and configure badge format and content through an external cascading style sheets (CSS) file that dynamically alters badges without the necessity to modify the underlying code. This end-to-end system offers conference organizers potential cost savings of thousands of dollars-money that can be diverted from costly, third-party badge vendors to open bars and other necessities.
Paper 100 Josh Horstman So You Want To Be An Independent Consultant: 2021 Edition While many statisticians and programmers are content with a traditional employment setting, others yearn for the freedom and flexibility that come with being an independent consultant. While this can be a tremendous benefit, there are many details to consider. This paper will provide an overview of consulting as a statistician or programmer. We'll discuss the advantages and disadvantages of consulting, getting started, finding work, operating your business, and various legal, financial, and logistical issues. The paper has been recently updated to reflect the new realities of independent consulting in 2021 and beyond.



Planning and Administration

Paper Authors Title Abstract
Paper 27 Denise Kruse SAS® This Week’s Forecast: Read From The Cloud How to collect data read from RDBMS The Base-SAS® Many data centers are considering and planning for moves to cloud storage at some point in the future. Cloud providers charge by the number of records read from the cloud which is much different than previous sizing exercises. This paper will detail how I approached the task of collecting records read from Oracle and Netezza from the SAS environment. Planning for cloud storage integration or cloud processing with SAS is a multi-faceted process. There are different configurations including hybrids of cloud and on premises to consider. I found that I could not even begin to address what we need in the future if I don’t have granular statistics of what goes on in the current SAS environment. Cloud providers charge a price by records read from the cloud. I needed to find out what kind of price tag that would look like in my current environment which brought me to the question “How do I track when users connect from the SAS environment into RDBMS and read records?” I was pleased to realize that I had all I needed to collect this information with SAS options. The SAS environment discussed in this paper is a 3 tier (Metadata, Compute, MidTier) set up on Linux. Estimated 95% of jobs were captured using this process. Batch jobs located in user locations amid the server were omitted. Evaluating the current SAS Environment 1. How do users connect to the SAS environment? 2. Are all user logs captured on the SAS server? 3. Are production run SAS logs captured? 4. How are connections made to RDBMS? 5. What type of data is available to aggregate counts into different categories? 6. How do I need to present the results to technology partners within the organization? How do users connect to the SAS environment? • SAS Enterprise Guide: WorkspaceServer • PC SAS: Foundation or ConnectServer • Scheduled SAS: BatchServer How are connections made to RDBMS? 1. Pre-assigned libnames via SAS Management Console (Oracle only) 2. Hardcoded libname statements (Oracle and Netezza) 3. Pass Thru design (Oracle and Netezza)
Paper 34 Kirk Paul Lafler SAS® Performance Tuning Techniques The Base-SAS® software provides users with many powerful techniques for accessing, manipulating, analyzing, and processing data and results. With the availability of so many language features and the size of data sources, application developers, programmers and end-users can benefit from a set of guidelines for efficient use of the SAS software. Topics include a number of performance tuning techniques that can be applied to code and applications to conserve CPU, I/O, data storage, and memory resources while performing tasks more efficiently when sorting, grouping, merging (or joining), summarizing, transforming, and processing data.
Paper 43 David Horvath To COMPRESS or Not, to COMPRESS or ZIP This session reviews the processing tradeoffs between uncompressed and SAS-compressed datasets as well as dealing with operating system compressed files and datasets. Is it better to process an uncompressed dataset or use SAS compression? What are the factors that influence the decision to compress (or not)? What are the considerations around applying operating system based compression (for example, Winzip or UNIX zip or GNU gzip) to regular files and SAS datasets? What are the tradeoffs? How can files in those formats be best processed in SAS?
Paper 60 Louise Hadden Management of Metadata and Documentation When Your Data Base Structure is Fluid: What to do if Your Data Dictionary has a Varying Number of Variables A data dictionary for a file based on Electronic Medical Records (EMR) contains variables which represent an unknown number of COVID-19 tests for an unknown number of infants - there is no way to know in advance how many iterations of the COVID test variable will exist in the actual data file from medical entities. In addition, variables in this file may exist for three different groups (pregnant women, postpartum women, and infants), with PR, PP and IN prefixes, respectively. This presentation demonstrates how to process such variables in a data dictionary to drive label (and value label) description creation for iterated (and other) labels using SAS functions, as well as other utilities.
Paper 95 Troy Hughes GIS Challenges of Cataloging Catastrophes: Serving up Geowaffles with a Side of Hash Tables to Conquer Big Data Point-in-Polygon Determination and Supplant SAS® PROC GINSIDE The GINSIDE procedure represents the SAS® solution for point-in-polygon determination, which answers the question of whether a geospatial reference point occurs inside or outside of a bounded region. This evaluation requires three parameters-a map data set representing the polygon (typically operationalized as a shapefile), an input data set containing one or more geospatial points, and a list of ID fields (i.e., attributes) that are conferred to all points falling inside the polygon. Thus, when lightning strikes or a tremor shatters the silence, its latitude and longitude are evaluated, and point-in-polygon evaluation confers in what city, county, state, or other jurisdictional district this occurred. Within the SAS application, the most significant factor predicting longer GINSIDE runtime (or outright GINSIDE failure) is file size of the input data set. When big data-having either a large number of fields or observations-are encountered, the GINSIDE procedure often operates inefficiently or fails with a runtime error. To improve point-in-polygon determination for big data, this paper demonstrates a novel approach in which a rectangular grid overlays the polygon of the bounded region, which produces geowaffles-rectangles that can be wholly inside the bounded region, wholly outside the bounded region, or straddling one or more boundaries. Geowaffles that are determined to be either wholly inside or wholly outside the polygon are maintained in a hash object, facilitating an efficient, in-memory point-in-polygon determination for all points not lying near a boundary-without the need to execute GINSIDE. Only the few remaining points that are proximate to polygon borders must be interpreted through the GINISIDE procedure, which facilitates runtimes that are more than ten times faster than the out-of-the-box SAS functionality.



Reporting and Graphics

Paper Authors Title Abstract
Paper 23 Hengwei Liu Side by Side Display of Table and Plot, Plot and Plot Programmers sometimes get request to display a table and a plot side by side or display two plots side by side. There are many tools and many ways to do this. The ODS LAYOUT in SAS® can display a table and a plot side by side; the PROC SGPANEL or graph template language in SAS can display two plots side by side. R shiny can also perform these tasks. To display a table and a plot side by side you can specify two columns in the ui part of the R shiny program, one column for the table and the other for the plot. The R shiny package gridExtra can display two plots side by side. These different methods are discussed in this paper.
Paper 38 Richann Watson and Louise Hadden "Bored"-Room Buster Bingo - Create Bingo Cards Using SAS® ODS Graphics Let's admit it! We have all been on a conference call that just ... well to be honest, it was just bad. Your misery could be caused by any number of reasons - or multiple reasons! The audio quality was bad, the conversation got sidetracked and focus of the meeting was no longer what it was intended, there could have been too much background noise, someone hasn't muted their laptop and is breathing heavily - the list goes on ad nauseum. Regardless of why the conference call is less than satisfactory, you want it to end, but professional etiquette demands that you remain on the call. We have the answer - SAS®-generated Conference Call Bingo! Not only is Conference Call Bingo entertaining, but it also keeps you focused on the conversation and enables you to obtain the pertinent information the conference call may offer. This paper and presentation introduce a method of using SAS to create custom Conference Call Bingo cards, moving through brainstorming and collecting entries for Bingo cards, random selection of items, and the production of bingo cards using SAS reporting techniques and the Graphic Template Language (GTL). (You are on your own for the chips and additional entries based on your own painful experiences)! The information presented is appropriate for all levels of SAS programming and all industries.
Paper 53 Sandeep Srivatsav Gangaraju and Kiran Venna Role of various factors impacting customer acquisition/ retention in a product-based organization using SAS. Customer acquisition/retention is one of the key performance metrics of any organization. Important factors impacting customer acquisition/retention are the channel of acquisition, product life cycle, understanding of segments and markets. This paper presents a case study of a fictitious company and shows how customer acquisition and retention are dependent on the above factors. With the help of various graphs, we will explain how each factor impacts customer acquisition/retention.
Paper 59 Louise Hadden Dressing Up your SAS/GRAPH and SG Procedural Output with Templates, Attributes and Annotation Enhancing output from SAS/GRAPH® has been the subject of many a SAS® paper over the years, including my own and those written with co-authors. The more recent graphic output from PROC SGPLOT and the recently released PROC SGMAP is often "camera-ready" without any user intervention, but occasionally there is a need for additional customization. SAS/GRAPH is a separate SAS product for which a specific license is required, and newer SAS maps (GfK Geomarketing) are available with a SAS/GRAPH license. In the past, along with SAS/GRAPH maps, all mapping procedures associated with SAS/GRAPH were only available to those with a SAS/GRAPH license. As of SAS 9.4 M6, all relevant mapping procedures have been made available in BASE SAS, which is a rich resource for SAS users, and in SAS 9.4 M7, further enhancements were provided. This paper and presentation will explore new opportunities within BASE SAS for creating remarkable graphic output, and compare and contrast techniques in both SAS/GRAPH such as PROC TEMPLATE, PROC GREPLAY, PROC SGRENDER, and GTL, SAS-provided annotation macros and the concept of "ATTRS" in SG procedures.
Paper 66 Dennis Beal Tips for Customizing Graphs Using Real Coronavirus Testing Data SAS® has many ways to generate high quality statistical graphics that include the older SAS/GRAPH® module to the latest SG plots using the Output Delivery System (ODS). Often times your clients may request very specific customizations to their plots that are not easily handled by simply changing an existing option within SAS. The annotation facility can be a powerful tool to customize your graphs. This paper shows examples of customized graphs using macros and the annotation facility on real publicly available coronavirus testing data. SAS code that generates the graphs is provided and discussed. This paper is for beginning or intermediate SAS users of Base SAS and SAS/GRAPH.
Paper 77 Jim Blum and Jonathan Duggins Getting Started with Attribute Maps: Methods for Creating and Storing Custom Style Definitions for Graphs, Charts, and Maps In this paper, methods for using attribute maps to control styles for graphs are discussed. We begin with the definition of an attribute map and a few simple examples to contrast setting styles via the map to setting them directly in the chosen ODS Graphics procedure. Both discrete attribute maps and range attribute maps are covered (note: range attribute maps are only available for SAS 9.4M3 and later releases). Advanced examples include defining multiple attribute maps in a single data set and using multiple attribute maps in the same graph. Strategies for making efficient, general use of attribute maps are presented. Attendees should have a decent working knowledge of the SGPLOT Procedure to get the most out of this presentation.



Statistics and Data Analysis

Paper Authors Title Abstract
Paper 3 Kannan Deivasigamani and Douglas Lunsford Statistical Test Selector for Researchers This SESUG paper demonstrates how a SAS® macro can be used to programmatically select the suitable statistical test for a given scenario. Gitanjali, Manikandan, and Raveendran (2014) have presented in their book written for post graduates to help with research methodology in a medical setup to help choosing the apt statistical test for their research (pg. 90-91). While not all statistical tests are included to make it comprehensive, the book contains required information on most of the commonly used statistical tests and the criteria used in selecting those as appropriate. The book by Gitanjali, Manikandan, and Raveendran (2014) is the seed to this technical SAS paper that is anticipated to come-in as a easy and quick reference macro mostly by budding researchers, students and others involved in statistical testing using research data via SAS. As SAS is a commonly used statistical software, it made more sense to the author to develop a macro to accept different parameters required in deciding the appropriate research for a specific design and setup.
Paper 6 Jenhao Cheng Converting Remeasurement Data into Percentile Ranks Based on Baseline Data Using PROC SQL: Patient Experience Measures for CMS Value-Based Purchasing Program As postulated by the Centers for Medicare and Medicaid Services (CMS) for hospital quality incentive, Value-Based Purchasing (VBP) program started its first payment adjustment in FY 2013 by evaluating both clinical quality and patient experience measures. Required by the methodology, patient experience data in performance period (remeasurement) should be converted into percentile ranks based on the baseline period distribution. In this study, overall rating scores of the last patient experience measure from 3,765 hospitals were analyzed. This paper presents two SQL based distribution-free methods that do not rely on the assumption of the underlying distribution and can be easily explained to users and implemented in a production tool. The first method is to repeatedly rank each remeasurement data point within the entire baseline distribution and then obtain the percentiles ranks by dividing these ranks by the sample size. The second method involves two steps with a percentile lookup table created first based on the baseline distribution and then the remeasurement data mapped to the lookup table. Both methods can be accomplished in SAS by using Proc SQL as it is a powerful tool to deal with the rank-based analytics and merge different datasets (tables) where full cross join, unequal join and nearest join are the possible options. The first method (one step and N x N dimension) is overall efficient and most accurate if N is sufficient and the data distribution is smooth without too many ties. When the data lack the smoothness due to moderate N or discrete behavior, or when the computational efficiency is a concern due to extremely large N, the second method (two steps and N x 100 dimension) is a viable alternative where more control on smoothness is possible by specifying the percentile definition in the first step. In this study both methods lead to very similar results except that some hospitals have one position lower for percentile ranks by method 2 due to the mild rounding issue. In addition to Proc SQL, Proc Univariate and Data Step were also used for the second method and all the analyses were conducted in SAS® 9.4 Software.
Paper 13 Stephen Sloan and Kevin Gillette Assigning agents to districts under multiple constraints using PROC CLP The Challenge: assigning outbound calling agents in a telemarketing campaign to geographic districts. The districts have a variable number of leads and each agent needs to be assigned entire districts with the total number of leads being as close as possible to a specified number for each of the agents (usually, but not always, an equal number). In addition, there are constraints concerning the distribution of assigned districts across time zones, in order to maximize productivity and availability. Our Solution: uses the SAS/OR ® procedure PROC CLP to formulate the challenge as a constraint satisfaction problem (CSP), since the objective is not necessarily to minimize a cost function, but rather to find a feasible solution to the constraint set. The input consists of the number of agents, the number of districts, the number of leads in each district, the desired number of leads per agent, the amount by which the actual number of leads can differ from the desired number, and the time zone for each district.
Paper 28 Deanna Schreiber-Gregory Back to Basics: Running an Analysis from Data to Refinement in SAS Data Science is the new Space Race, launching us into a world of immeasurable possibility, but with only a few people to help us navigate it. As we dig deeper, discover more, and risk more, we can be simultaneously led to both great insight and loss. If we do not know what we are doing, Data Science can be a very dangerous thing. It is important for us all to learn at least a little bit about the possibilities and risks of this field of study, so we can navigate it together. This paper was written to give individuals new to SAS® and/or Analytics a gentle nudge in the direction of the possibilities available through Data Science and SAS. It is designed to help you navigate through the process of data exploration by using publicly available COVID 19 data. We have all seen how fragile this data reporting can be, and this paper uses this fragility to help explain the dangers of an inappropriately implemented analytic process. Together, we will briefly touch on current best practices and common errors that occur at the different steps of the analytic process (choosing data, exploring data, building and running a model, checking and refining model performance) while simultaneously reviewing common SAS procedures used in each of these steps (Data Step, Univariate Procedures, Multivariate Procedures, Power & Model Fit Procedures). At the end of this paper, the author provides several citations and recommended readings to help interested analysts further their education in Data Science implementation. Data is everywhere and understanding data science is a growing necessity for navigating today's world. This paper is meant to help give individuals a snapshot of insight into the vastness of possibility that is Data Science.
Paper 51 Austin Brown A Macro to Utilize a Nonparametric Multiple Stream Process Quality Control Chart in SAS Statistical process control charts have been shown to be useful tools in monitoring and improving the quality of a variety of processes over the past century. For a control charting scheme to be successful, it should be chosen to match the specifications of the process to be monitored. For example, there may be instances when several processes which are assumed to be identical and desired to be monitored simultaneously rather than independently. Such a process is commonly referred to as a "multiple stream process (MSP)." Control charts designed for MSP monitoring have typically assumed that the observations being monitored follow a Normal distribution. This assumption may not always be met, in which case, the existing charts begin to become inefficient at detecting anomalies in the process being monitored. Recently, a new control chart was developed for monitoring MSPs, which is nonparametric in nature, which means that its performance will remain consistent regardless of the underlying distribution. This chart is called the "Nonparametric Extended Median Test Cumulative Summation Chart (NEMT-CUSUM)." However, one issue with this control chart is that there is no current procedure for utilizing the technique in SAS software. Thus, the purpose of this paper is to develop a SAS macro function to use the NEMT-CUSUM control chart in SAS. Examples and discussion will be provided.
Paper 55 Peter Flom An Introduction to Classification and Regression Trees with PROC HPSPLIT Classification and regression trees are extremely intuitive to read and can offer insights into the relationships among the IVs and the DV that are hard to capture in other methods. I will introduce these methods and illustrate their use with PROC HPSPLIT.
Paper 58 Chun Du Use SAS Enterprise Miner Workstation 15.1 to Do Predictive Analysis for Mobile Strategy Games Industry This study aims to predict mobile strategy games' ratings, to find the relationships between the game's factors and games ratings, and help game developers, game players, and game companies to define a successful game. The rating of games is divided into two groups: group 1 - rating is below 4 (bad performance), group 2 - rating is 4 and above 4 (good performance). The overview of study plan is dividing the sample into 70/30 training and validation. Logistic regression, decision tree and neural network will be used to build predictive models. By conducting text cluster analysis and topic analysis to analyze games descriptions and find the traits and categories of strategy mobile game with rating above 4. The final predictive model, neural network is the winner models with sensitivity of predicting game rating of 97.91% and total 25.87% misclassification rate.
Paper 67 Tamar Roomian A SAS macro program to calculate the Fragility Index Fisher's Exact Test is a statistical test used to determine the statistical significance of a 2x2 contingency table when the sample size in any of the cells is <5. In comparative research healthcare studies where the outcome of interest is rare, statistical significance can be easily flipped by changing a small number of events. Feinstein first proposed the Fragility Index (FI) as a measure to determine the statistical stability. The Fragility Index is the number of outcomes required to reverse statistical significance and can be used to assess the statistical stability of studies with rare outcomes. To our knowledge, there is currently no existing SAS macro program that can calculate the FI and FQ. We have created a SAS macro program that calculates the number of event switches required to flip a 2-sided Fisher's exact test for a list of studies. Researchers can then calculate descriptive statistics on the FI to assess statistical stability within a field of study. This presentation will introduce the fragility index, walk through the logic of the macro program, and demonstrate with an example.
Paper 89 Bruce Lund Screening, Binning, Transforming Predictors for a Generalized Logit Model The generalized logit model is a logistic regression model where the target (or dependent variable) has 3 or more levels, and the levels are unordered. Predictors for the generalized logit model may be NOD (nominal, ordinal, discrete-numeric) where, typically, the number of levels is under 16. Alternatively, predictors may be continuous where the predictor is numeric and has many levels. This paper discusses methods that screen, bin, and transform both NOD and continuous predictors, as preparation for model fitting. These same methods also apply to the cumulative logit model (where the target is ordered). The binning methodology is applied to NOD predictors and generalizes the concept of information value. The method of transforming a continuous predictor is an extension of the function selection procedure (FSP) to the multinomial target. SAS® macros are presented which implement the methods for screening, binning, and transforming. Familiarity with PROC LOGISTIC is assumed.
Paper 90 Jason Brinkley Using PROC SURVEYSELECT to create data files with all pairwise combinations of data PROC SURVEYSELECT provides an easy mechanism to create datafiles that are all pairwise combinations of two observations in a dataset. While the procedure is traditionally used for creating subsamples of data, there are options that allow one to use the entire data. This paper illustrates how to take an address based dataset and create a new dataset that has all pairwise combinations of each set of addresses so that additional analyses can be done.



e-Posters

Paper Authors Title Abstract
Paper 4 Lauren Rackley Effectively Searching for Resources for a Newer SAS User Newer SAS programmers get stuck with programming more often than experienced SAS users. This paper will discuss several resources that newer SAS users can utilize for the times when they get stuck with complicated programming tasks or tasks that they are unfamiliar with. For example, someone with limited statistical knowledge may still be asked to create a Kaplan-Meier plot and by searching Google, several SUG papers come up that outline how to customize Kaplan-Meier plots. With abundant resources, new SAS programmers can quickly enhance their knowledge and improve their SAS programming abilities.
Paper 5 Abbas Tavakoli and Navid Tavakoli Using QUANTREG to Examine time and Drug on Histamine HA level in Mice Quantile regression model to conditional quantiles of the response variable to different percentile. Quantile regression is useful when the rate of change in the conditional quantile. Flexibility for modeling data with heterogeneous conditional distributions is one of an advantage of quantile regression over ordinary regression model. Quantile Regression can be used in many fields, including biomedicine, econometrics, and ecology. The SAS QUANTREG procedure used to perform regression analysis when the assumption of ordinary regression does not meet. The QUANTREG procedure uses quantile regression to model the effects of covariates on the conditional quantiles of a response variable. SAS provides practical and efficient ways to analyze different type of data with heterogeneous conditional distributions. The purpose of this paper is to examine the slope of time on Histamine (HA) level in mice is different by drug (desipramine) as compare to control. All of slopes were significant (P<.001) except time slope at .75 (P=.864). The results indicated that group was not significant for quantile level .05 (P=.426) and quantile level .85 (P=.458). In addition, time was not significant for quantile level of .75 and .80 (P=.864 and .119), interaction was not significant for quantile level .05 (P=.242). SAS is a powerful statistical program to analyze complex statistical procedure.
Paper 21 Hengwei Liu Some Linux Shell Scripts for SAS® Programmers Many pharmaceutical companies use SAS on Linux server. Linux shell scripting is a power tool for the SAS programmers. It can be used to read text files and extract information. It can be used to do some file operations. In this paper some Linux shell scripts of interest to SAS programmers are discussed.
Paper 32 Kirk Paul Lafler Ten Rules for Better Charts, Figures and Visuals The production of charts, figures and visuals should follow a process of displaying data in the best way possible. However, this process is far from direct or automatic. There are so many different ways to represent the same data: histograms, scatter plots, bar charts, and pie charts, to name just a few. Furthermore, the same data, using the same type of plot, may be perceived very differently depending on who is looking at the figure. A more inclusive definition for the production of charts, figures and visuals would be a graphical interface between people and data. This presentation highlights the work of Nicolas P. Rougier , Michael Droettboom, and Philip E. Bourne by sharing ten rules to improve the production of charts, figures and visuals.
Paper 54 Imelda Go Confirming Data Redundancy or Inconsistency in SAS® This paper goes over a quality control example on how to confirm data redundancy and identify data inconsistencies when the expectation is data redundancy/consistency. PROC MEANS is used to diagnose data redundancy/inconsistency by generating an output data set that can be presented meaningfully with PROC TABULATE. For example, a test item appears on different test forms. Its item meta data elements are expected to be identical across test forms. How can you easily confirm that the item meta data are identical across all test forms and how can you identify inconsistencies in a user-friendly report to facilitate the identification and correction of the inconsistencies?
Paper 61 Louise Hadden Looking for the Missing(ness) Piece Reporting on missing and/or non-response data is of paramount importance when working with longitudinal surveillance, laboratory and medical record data. Reshaping the data over time to produce such statistics is a tried and true technique, but for a quick initial look at data files for problem areas, there's an easier way. This quick tip will speed up your data cleaning reconnaissance and help you find your missing(ness) piece. Additional tips on making true missingness easy to identify are included.
Paper 74 Stephen Sloan A Quick Look at Fuzzy Matching Programming Techniques Using SAS® Software Data comes in all forms, shapes, sizes and complexities. Stored in files and data sets, SAS® users across industries know all too well that data can be, and often is, problematic and plagued with a variety of issues. Two data files can be joined without a problem when they have identifiers with unique values. However, many files do not have unique identifiers, or "keys", and need to be joined by character values, like names or E-mail addresses. These identifiers might be spelled differently, or use different abbreviation or capitalization protocols. This paper illustrates data sets containing a sampling of data issues, popular data cleaning and user-defined validation techniques, data transformation techniques, traditional merge and join techniques, the introduction to the application of different SAS character-handling functions for phonetic matching, including SOUNDEX, SPEDIS, COMPLEV, and COMPGED, and an assortment of SAS programming techniques to resolve key identifier issues and to successfully merge, join and match less than perfect, or "messy" data. Although the programming techniques are illustrated using SAS code, many, if not most, of the techniques can be applied to any software platform that supports character-handling.