SESUG 2013 Conference Abstracts
Back to Basics
Ten Things You Should Know About PROC FORMAT
The SAS(r) system shares many features with other programming languages and reporting packages. The programming logic found in the ubiquitous data step provides the mechanisms for assignment, iteration, and logical branching which rest at the core of any procedural language. Analytic data displays, like the humble frequency cross-tabulation produced by PROC FREQ or PROC REPORT, may be replicated with varying degrees of success using any number of other products. PROC FORMAT is another matter.
Somewhat like an enumerated data type; somewhat like a normalized and indexed reference table; it really has no exact analog in these other products and packages. There’s a lot you can do with PROC FORMAT. And, there’s a lot to know about PROC FORMAT. The aim of this paper is to provide insight on at least ten of those things which you should know.
Tailoring Proc Summary for More Efficient Summarizations
When using Proc Summary many who are new to SAS programming sort their data and then summarize. Although there are many summarization techniques possible with Proc Summary, the objective of this paper is the presentation of a fundamental technique, showing how to eliminate the perceived need to pre-sort Proc Summary input data sets and how to tailor Proc Summary to produce only the exact summaries required.
Basketball Analytics: Optimizing the Official Basketball Box-Score (Play-by-Play)
What if basketball analytics could formulate an “end-all” value that could justly evaluate team and/or player performance?
Those immersed in the world of basketball analytics are challenged with this mission: to translate a game of interdependent factors into simple measures of player and team performance. The Official Box-Score (Play-by-Play), for example, significantly serves as an invaluable guide in understanding the analysis of the game on a fundamental level. With an unbiased precision, the Official Box-Score (Play-by-Play) displays descriptive measurements used to inform how well or how poorly a player and/or team have performed. With the usage of advanced statistical software tools, I have extended the Official Box-Score and established a never-ending framework that can measure the offensive and defensive prowess for a basketball team, by lineup (overall and per game). Relative to these measurements of frequency, efficiency, and precision, I will discuss its methodology and display examples of output for the evaluation of player and/or team lineup performance.
Though basketball analytics comes with its limitations and imperfections, the pursuit of the advancement of knowledge of the game further incites ongoing analyses and a penchant for better statistics!
A Few Usefull Tips When Working with SAS®
Often working very hard to meet deadlines we stick to old proven solutions. But SAS is a live software and with each release / version we should learn new features which will apply efficiencies and make our working life easier. Improvements in Proc Sort, combining various old/new SAS functions, doing more in less number of steps (Data or Proc) improves performance, readability and frequently maintenance of SAS programs. Paper would deal with matching data, checking duplicates, creating more compact Excel files, and validating data in output non SAS files.
Help! I Need to Report a Crime! Why is PROC REPORT So Hard to Use?
Business analysts often need to create flexible summarized reports for large amounts of data and attributes that may be easily enhanced or changed based on business needs. Using PROC REPORT can replace cumbersome manual reporting and automatically generate sub-totals, percentages and summaries for reports as well as allow formatting for areas of interest within a report. This paper discusses how to prepare your data and to programmatically generate reports using PROC REPORT that contain information you would expect of any report including titles, footnotes, data and column formatting, percentages, averages and other statistical measures, sub-totals and grand totals. Let PROC REPORT be your partner in crime instead of criminally confusing!
Repetitive Tasks and Dynamic Lists: Where to Find What You Need and How to Use It
Often, the same operation needs to be performed for a number of items, and the number and value of those items might not be known until it is time for a program to be run. Print out the contents of an entire set of datasets within a library, generate a narrative file for each subject in a study, read in every tab from a set of Excel spreadsheets – requests for these tasks (or something similar) come up frequently.
These repetitive tasks can be handled by looping through across a list of items, but sometimes, it is not clear how to populate that list. This paper discusses methods for generating and managing such a list dynamically, using the list within a simple macro loop, as well as how to access information within SAS to find some common elements – data sets within a library, variables within a data set, files within a directory, and tabs within an Excel spreadsheet.
Chi-Square and T-Tests Using SAS®: Performance and Interpretation
Jennifer Waller and Maribeth Johnson
Data analysis begins with data clean up, calculation of descriptive statistics and the examination of variable distributions.
Before more rigorous statistical analysis begins, many statisticians perform basic inferential statistical tests such as chi-square and t-tests to assess unadjusted associations. These tests help to guide the direction of the more rigorous analysis. How to perform chi-square and t-tests will be presented. We will examine how to interpret the output, where to look for the association or difference based on the hypothesis being tested, and propose next steps for further analysis using example data.
Transitioning from Batch and Interactive SAS to SAS Enterprise Guide
Although the need for access to data and analytical answers remains the same, the way we get from here to there is changing. Change is not always adopted nor welcomed and it is not always voluntary. This paper intends to discuss the details and strategies for making the transition from traditional SAS software usage to SAS Enterprise Guide. These details and strategies will be discussed from the company as well as the individual SAS developer level. The audience for this paper should be traditional SAS coders who are just getting exposed to SAS Enterprise Guide but still want to write code.
A Quick and Gentle Introduction to PROC SQL
Shane Rosanbalm and Sam Gillett
If you are afraid of SQL, it is most likely because you haven’t been properly introduced. Maybe you’ve heard of SQL but never quite had the time to explore. Or, maybe you were introduced to SQL, but it was at the hands of a database administrator turned SAS programmer who liked to write un-indented queries that were 60-lines long and combined 15 different datasets; egads! The goal of this paper is to provide an introduction to PROC SQL that is both quick and gentle. We will not discuss joins, subqueries, cases, calculated, feedback, return codes, or anything else that might dampen your initial momentum. Don’t get me wrong, these are all wonderful topics. But they are better suited to a paper in the Hands-on Workshop or Beyond the Basics sections. Our focus will be limited to introductory topics such as select, from, into, where, group by, and order by. Once these basic topics have been introduced, we will use them do demonstrate some simple yet powerful applications of SQL. The applications will focus on situations in which the use of SQL will result in code that is both more efficient and more concise than what can be achieved when limiting oneself to the DATA step and other non-SQL procedures.
Lost in Translation: A Statistician’s (Basic) Perspective of PROC LIFETEST
Survival analysis is a main-stay in clinical trials. It is the bread and butter of oncology trials. It can also be the most frustrating part of the statistical programmer’s day. There can be a huge disconnect between the language of a biostatistician and a statistical programmer. The phrases “Survival Analysis”, “Time to Event Analysis”, and “PROC LIFETEST” can mean the same thing, or very different things. In this article, I provide a basic understanding of the reasoning and method behind PROC LIFETEST as well as a “statistician translator”. Using example data and example code, I compare what the programmer sees to what the statistician would describe. Output such as .lst summaries, datasets, and plots gives the opportunity to explore what PROC LIFETEST can do and how to interpret the results. Tips for avoiding errors, choosing the best method, and utilizing the output datasets should make your next use of PROC LIFETEST more enjoyable and rewarding.
Going from Zero to Report Ready with PROC TABULATE
The TABULATE procedure in SAS can be used to summarize your data into clean and organized tables. This procedure can calculate many of the descriptive statistics that the MEANS, FREQ, and REPORT procedures do, but with the flexibility to display them in a customized tabulated format. At first, the syntax may seem difficult and overwhelming but with practice and some basic examples you can go from zero to report ready in no time. This paper will discuss the benefits of using PROC TABULATE and identify the kinds of reports that this procedure is best suited. An example will be used to illustrate the syntax and statements needed to generate a complex table. The table will include multiple classification variables as well as more than one numeric variable for various computed statistics. Readers will learn the functions of the CLASS, VAR and TABLE statements and how to include subtotals and totals with the keyword ALL. To make the finished table ‘report ready’, examples of how to apply formats, labels and styles will also be shared.
Submitting SAS® Code on the Side
This paper explains the new DOSUBL function and how it can submit SAS® code to run “on the side” while your DATA step is still running. It also explains how this function differs from invoking CALL EXECUTE or invoking the RUN_COMPILE function of FCMP. Several examples are shown that introduce new ways of writing SAS code.
Data Entry in SAS® Strategy Management: A New, Better User (and Manager) Experience
Data entry in SAS® Strategy Management has never been an especially pleasant task given the outdated user interface, lack of data validation and limited workflow management. However, release 5.4 unveils a complete overhaul of this system. The sleeker HTML5-based appearance provides a more modern Web experience. You are now able to create custom data validation rules and attach a rule to each data value. Discover how form workflow is supported via SAS Workflow Studio and the SAS Workflow Services.
Looking for an Event in a Haystack?
Maya Barton and Rita Slater
How do you identify where a specific event occurred during a visit period? Or whether a recorded event actually occurred during a particular visit? Typically Concomitant Medications (CMs) and Adverse Events (AEs) form a long list of events, that is, a dataset with each event's description, start and end dates. Typically these events appear on visit forms as a data point marking the occurrence of the event. A common programming task involves displaying any inconsistencies in a report for the data manager's perusal, such that he or she can query the site in question and clean the data.
This paper details the generation of such a report by building a dataset of the inconsistencies. In examining these inconsistencies, four case report forms (CRFs) come into play: a physical visit form (VIST), a telephone visit form (TELE), a CM form (CM), and an AE form (AE). Through a series of sorts, formats, and joins, this paper explains how these CRFs can be reconciled with one another to form a presentable dataset of inconsistencies.
Beyond the Basics
5 Simple Steps to Improve the Performance of your Clinical Trials Table Programs using Base SAS® Software
Performance is defined in the Encarta® World English Dictionary as “manner of functioning: the manner in which something or somebody functions, operates, or behaves”.
In order to improve the performance of a SAS program, we need
to look at several factors. These include not only the processing speed of the SAS program but also the readability and derstandability of the program so that it can be edited efficiently.
Most clinical trials tables involve taking several SAS data sets, which can be rather large, combining them and then producing statistics ranging from simple counts and means, using PROC FREQ or PROC MEANS, to more complex statistical analysis. There are a few practices that each programmer can embrace to make these programs more efficient, thereby increasing the performance of these programs.
This paper will detail these steps, with examples, so that you can add them to your SAS tool bag for your next set of project outputs.
Data Review Information: N-Levels or Cardinality Ratio
This paper reviews the database concept: Cardinality Ratio. The SAS(R) frequency procedure can produce an output data set with a list of the values of a variable. The number of observations of that data set is called N-Levels. The quotient of N-Levels divided by the number-of-observations of the data is the variable’s Cardinality Ratio (CR). Its range is in (0–1].
Cardinality Ratio provides an important value during data review. Four groups of values are examined.
Data in the Doughnut Hole: Using SAS® to Report on What is NOT There
Typically the request for a table, graph or other report concerns data that currently exists and needs to be explored. However, there is often a need to examine data that is expected to be present but currently is not. Yet how can we report on something that is not there? This paper describes and explores ways to create simulations of what is expected and then match those with data that is actually present.
Making comparisons between what has been anticipated and what actually exists then opens up the ability to report on what is not there. This paper also explores how to test the differences between the groups based on a variety of conditions and how to determine what is legitimately absent. Reporting on the absent data is then described, including how to check that everything which is expected is actually present and be able to report that as well. DATA step techniques are combined with PROC SQL and PROC REPORT to create a step-wise process which can be easily modified to fit most any set of specifications. Let’s dive into the sweet task of reporting from inside the doughnut hole!
Keywords: shell data, PROC SQL, missing records, PROC REPORT
EG and SAS: WORK-ing together
Enterprise Guide and regular SAS users are often two distinct groups. This discussion tries to widen the slice of the Venn diagram representing users of both products. It presents a way for EG users to look at a SAS session's WORK directory, and vise-versa. There are often times when it would be helpful for a regular SAS user to hop over to EG for a quick second to say, look at a SAS data set in MS Excel (a two-click process once EG is up and running). Another example, would be to see, real-time, how different formats look on a data value, or perhaps how EG would write a piece of code for a task. Going the other way, perhaps an experienced SAS user but new EG user, gets part-way through an EG task and wants to see how a certain piece of code behaves in their normal, comfortable SAS environment. Instead of a best-of-bread philosophy between vendors, this is a best-of-bread discussion within a vendor, using each product, EG and SAS, for the things each does best.
Use SAS® to create equal sized geographical clusters of people
How do you determine where staff should be located to serve your population when that population is spread out across a geographic region?
With Proc Geocode and the SAS® supplied zip code files the tools are available to determine where staff should be located to fit your population. One method has been developed for figuring out how the staff should be distributed. This paper will discuss this method in addition to how to summarize and display the results using SAS® maps.
PEEKing at Roadway Segments
In this practical application of some special SAS functions and CALL routines we control the location in memory of variables to be compared from one observation to the next. Forcing the variables to be written adjacent to one another enables us to treat them as a single variable. We use the special functions PEEKC, ADDR, and CALL POKE, along with a DOW loop.
Reading Data from Microsoft Word Documents: It's Easier Than You Think
Base SAS provides the capability of reading data from a Microsoft Word document and it’s easy once you know how to do it. Bookmarks and the FILENAME statement with the DDE engine make it unnecessary to export to Excel or work
with an XML map. This paper will go through the steps of having SAS read a Word document and share a real-world example demonstrating how easy it is. All levels of SAS users may find this paper useful.
Your Opinion Counts: Using Twitter to Poll Conference Attendees
Peter Eberhardt and Matt Malczewski
Twitter is a ubiquitous communications tool used to send some interesting (and often not so interesting) messages to friends and followers. The sheer volume of messages provides a veritable gold mine of unstructured text for marketers to mine, and for companies to monitor to ensure any “bad press” is quickly identified and acted upon. In this paper we will show how SAS® can access tweets in a fun interactive audience polling exercise.
Taming a Spreadsheet Importation Monster
As many programmers have learned to their chagrin, it can be easy to read Excel® spreadsheets but problems can arise when one needs to concatenate the data from two or more sheets. Unlike SAS®, Excel does not require that a column have only data of one type so as it scans the columns in a spreadsheet, SAS uses rules to decide the nature of the data in each column. Problems will occur if in one spreadsheet a column is read as characters while in other spreadsheets the data are read as numeric. If there are legitimate character data in the column, then the user will have to decide how to handle them. However, if SAS sees too many blank cells at the start of a column, the incoming variable will be in character format. This paper uses information in dictionary tables to generate SAS code that examines all of the columns (variables) in the workbook and determines which are seen in both formats. Then, data steps read in the “bad” spreadsheets using a data set option to force the problematic columns to be treated as numeric.
Note that this approach requires that the site has licensed SAS/ACCESS Interface to PC Files and the coding is based on a Windows environment.
Small Sample Equating: Best Practices using a SAS Macro
Anna Kurtz and Andrew Dwyer
Test equating is used to make adjustments to test scores to account for differences in the difficulty of test forms. Small-scale testing programs (e.g., highly specialized licensure or certification) often face the reality of needing to perform test equating when sample sizes fall well below recommended numbers, and equating with samples smaller than 100, or even very small samples below 25, is not unheard of in practice. This study used resampling methodology, SAS Macro programming, and data from a large national certification test to examine the accuracy of several small sample equating methods (mean, Tucker, circle-arc, nominal weights mean, and synthetic) along the score distribution. The results of this study indicated that for small samples (i.e., below 50), circle-arc slightly outperformed the other methods, although the difference was minimal and depended on the size of the difference in test form difficulty. Furthermore, we propose that for well-behaved tests where the primary purpose is to make accurate pass/fail classifications of candidates, a sample size of 50 would be sufficient for test equating.
Computing Counts for CONSORT Diagrams: Three Alternatives
It is standard practice to document how the sample used for a study was determined – first showing all subjects considered and then the numbers of same that were successively removed from the cohort for various reasons. SAS does not provide a specific facility for generating the tables/diagrams commonly used to present cohort determination information. Recently, Art Carpenter published a paper showing how to create such diagrams given a template RTF file and the counts that pertain to each box in the template. However, his work did not address the question of how to derive the counts. We examine two commonly used approaches and provide a SAS macro to automate a third approach. This third approach has superior reproducibility, eliminates transcription errors, and makes changes to the diagram easier to effect.
The macro we present uses SAS datasets prepared by the user to convey the user’s specifications for the counts to be produced and the logical expressions associated with each count. Thus, the macro is able to compute counts for almost any consort diagram. The technique of enabling the user to supply SAS expressions via a dataset and thereby control program flow is a little used macro programming strategy with potentially broad application.
Averaging Numerous Repeated Measures in SAS Using DO LOOPS and MACROS: A Demonstration Using Dietary Recall Data
Kendra Jones and Kyla Shelton
Background/Objective: Dietary recall data consists of numerous measures recorded across multiple days for a particular individual. The raw measures must be collapsed and averaged before meaningful analysis can begin. Basic SAS programming can calculate the averages, but cannot handle the nearly countless number of variables without knowledge of each variable name and extensive syntax to process each one. The objective of this paper is to demonstrate a widely applicable process for averaging these or similar types of measures. The process uses intermediate programming techniques to eliminate specific variable name programming and reduce syntax.
Methods: The process begins with a MACRO program to import raw data files. A DATA STEP program combines and formats the datasets for analysis. PROC CONTENTS with the OUT= option creates an additional dataset containing the names of all numeric variables in the analysis dataset. PROC SQL, with the INTO: keyword, builds a macro variable made up of each variable name separated by a space. The SCAN function is later used to select and process these variables one by one. The CALL SYMPUT routine, within a _NULL_DATA STEP, determines the number of records in the variable list dataset and stores this value as a macro variable. The calculation portion of the program functions within a %DO LOOP. PROC SQL, with by group processing, calculates the mean of each repeated measure/variable. The record count and variable list macro variables direct the do loop processing. A DATA STEP with a MERGE function captures and stores each calculation before processing advances to the next variable.
Results: The complete process yields a single dataset of numerous calculated values suitable for continued application in further analyses.
Conclusions: This method successfully combines multiple intermediate SAS programming components to calculate the average of numerous repeated measures. More importantly, the use of variable list and record count macro variables eliminates the need to program in specific variable names and reduces syntax when the number of variables is large. Therefore, it is a widely applicable process that can be customized to a variety of needs and datasets.
A Novel Approach to Code Generation for Data Validation and Editing using Excel and SAS
Mai Nguyen, Katherine mason and Shane Trahan
A common task in the data processing of hardcopy surveys is data validation and editing. For complicated and/or lengthy surveys, this task can be very labor-intensive and error-prone. The challenge was to find a better way to improve the process of specifying validation/editing rules and implementing these rules in a more efficient manner. Validations may include things such as:
Often these types of validations and data cleaning/editing involve writing SAS code. Our team has developed a unique approach which uses Microsoft Excel to generate most of SAS code from the validation and edit rules specified in the spreadsheet. This technique provides multiple benefits:
- Detecting skip pattern violations
- Detecting missing critical items
- Detecting logical inconsistencies between responses within a questionnaire
In this paper, we’ll provide the implementation details of the technique and demonstrate the technique with some code samples. We’ll also provide the pros and cons of this technique and possible enhancements for the future.
- Codes are easily edited and updated automatically when the rules change
- Repeating patterns can be quickly observed and thus similar rules can be implemented quickly using a “copy and paste” technique
- Better quality control because the rules and generated codes are kept together side-by-side in the spreadsheet
Same Data Different Attributes: Cloning Issues with Data Sets
When dealing with data from multiple or unstructured data sources, there can be data set variable attribute conflicts. These conflicts can be cumbersome to deal with when developing code. This paper intends to discuss issues and table driven strategies for dealing with data sets with inconsistent variable attributes.
Finding the Gold in your Data - an Introduction to Data Mining
This is an introductory overview of the major components of SAS* Enterprise Miner. The focus is on explanation of the available methods, the types of data that are appropriate for these analyses, and examples, rather than mathematical developments or software usage instructions. It will be of most value to users who are new to or have never used a data mining tool. Topics include recursive splitting, logistic regression, neural networks, association analysis, and a brief introduction to text mining. This talk was presented as a statistics keynote address at SAS Global Forum 2012.
List Processing With SAS: A Comprehensive Survey
In SAS, a list is not the data structure well known in other general programming languages like Python; it’s rather a series of values to facilitate data driven programming. This paper takes a comprehensive survey on SAS list processing techniques (function like macros), from basic operations (create, sort, insert, delete, search, quote, reverse, slice, zip and etc.) to real life applications. Sources were taken from varies of papers and online macro repositories by SAS gurus Roland Rashleigh-Berry, Ian Whitlock, Robert J. Morris, Richard A. DeVenezia and Chang Chung. The author also contributes some macros (like how to slice a list) to fill the holes in this big picture. All codes were hosted in github.com (https://github.com/Jiangtang/Programming-SAS/tree/master/ListProcessing) and comments, forks are welcome (don’t reinvent wheels!).
Solving Samurai Sudoku Puzzles – A First Attempt
John R Gerlach
Imagine a Sudoku puzzle that consists of a 9x9 matrix having about 30 out of 81 cells populated. Now, imagine five such Sudoku puzzles such that a center puzzle is joined to four others at their respective 3x3 corner sub-matrices. These five Sudoku puzzles define a Samurai Sudoku puzzle. The solution to the puzzle is the same: each puzzle having unique values per row, column, and sub-matrix. However, there is an obvious interdependence among the five puzzles that poses a new challenge. This paper explains an expanded version of a SAS solution that used a dynamic cube to solve a regular Sudoku puzzle, by incorporating five cubes to solve the Samurai puzzle.
Pre-conference Seminar: Advanced Macro Design
Description: This paper provides a set of ideas about design elements of SAS(R) macros.
Purpose: Checklist for macro programmers.
Audience: imtermediate programmers writing macros.
Big Data in the Warehouse – Quality is King and SAS Can Do It All
The concept of the data warehouse has hit the quarter-century mark. And in many industries the size is hitting the Petabyte mark.
Data warehouses are seen as indispensable in many industries, including Financial Services. The term “Big Data” has no well-define meaning, but can be applied to many large data stores, including data warehouses. There are some who say that Big Data is a new thing that requires new tools. But it can be shown that SAS is well up to the task of managing and analyzing large data stores. No data warehouse will provide return on investment without ensuring that the data has sufficient quality to meet user demands, and data quality assurance is where SAS can truly shine.
We will look at how a data warehouse is built and maintained, how quality is ensured and how SAS tools provide the ideal platform for managing it all. Specifically, we’ll see how SAS can be the best of all possible ETL tools, how it can supplement and help cross-check other ETL tools and how data quality testing is done. And we’ll look at how SAS fares with really big data.
Binning Bombs When You’re Not a Bomb Maker: A Code-Free Methodology to Standardize, Categorize, and Denormalize Categorical Data
Troy Martin Hughes
Categorical data are as common as the complications that arise when manipulating them throughout Extract Transform Load (ETL) processes. A typical objective is to create a flat file (i.e., a denormalized table) from a single or multiple normalized data sets. Variations in data may make this task daunting if not impossible. Moreover, once a data set has been standardized effectively, analysts may want to prescribe different models upon the data, including a hierarchical structure that defines these categorical data. The analyst writing the SAS code may not be the subject matter expert on the content of the data and, with successive re-writes of code to fulfill desires of all involved, the resultant code may likely be a jumbled mess. The proposed macros offer a CODE-FREE process that standardizes categorical fields to remove spelling and other variations, categorizes these fields into a multi-tiered hierarchical structure, and finally denormalizes the resultant data into a flat file for analytic use. Standardization and categorization are prescribed through an XML-like format that non-coders can freely edit and modify without knowledge of SAS; moreover, multiple schemas can be saved and run through, producing varying results that conform to their respective models.
Introducing the New ADAPTIVEREG Procedure for Adaptive Regression
Predicting the future is one of the most basic human desires. In previous centuries, prediction methods included studying the stars, reading tea leaves, and even examining the entrails of animals. Statistical methodology brought more scientific techniques such as linear and generalized linear models, logistic regression, and so on. In this paper, you will learn about multivariate adaptive regression splines (Friedman 1991), a nonparametric technique that combines regression splines and model selection methods. It extends linear models to analyze nonlinear dependencies and produce parsimonious models that do not overfit the data and thus have good predictive power. This paper shows you how to use PROC ADAPTIVEREG (a new SAS/STAT® procedure for multivariate adaptive regression spline models) by presenting a series of examples.
ISO 101: A SAS® Guide to International Dating
Peter Eberhardt and Xiao Jin Qin
For most new SAS programmers SAS dates can be confusing. Once some of this
confusion is cleared the programmer may then come across the ISO date formats
in SAS, and another level of confusion sets in. This paper will review SAS
date, SAS datetime, and SAS time variables and some of the ways they can be
managed. It will then turn to the SAS ISO date formats and show how to make
your dates international.
RUN_MACRO Run! With PROC FCMP and the RUN_MACRO Function from SAS® 9.2, Your SAS® Programs Are All Grown Up
When SAS® first came into our life, it comprised but a DATA step and a few
procedures. Then we trained our fledgling programs using %MACRO and %MEND
statements, and they were able to follow scripted instructions. But with SAS
9.2 and 9.3, your macros are now wearing the clothes of a PROC FCMP function
– they can handle complex tasks on their own. FCMP functions are independent
programming units that – with the help of the special RUN_MACRO function -
offer dynamic new capabilities for our macro routines. This presentation will
walk through an advanced macro and show how it can be rewritten using a PROC
FCMP and RUN_MACRO approach.
Let SAS® Do the Coding for You
Many times, we need to create the same reports going to different groups based on the group’s subset of queried data or we have to develop many repetitive SAS codes such as a series of IF THEN ELSE statements or a long list of different conditions in a WHERE statement. It is cumbersome and a chore to manually write and change these statements especially if the reporting requirements change frequently. This paper will suggest methods to streamline and eliminate the process of writing and copying/pasting your SAS® code to be modified for each requirement change. Two techniques will be reviewed along with a listing of key words in a SAS® dataset or an Excel® file: 1) Create code using the DATA _NULL_ and PUT statements to an external SAS code file to be executed with %INCLUDE statement. 2) Create code using the DATA _NULL_ and CALL SYMPUT to write SAS® codes to a macro variable. You will be amazed how useful this process is for hundreds of routine reports especially on a weekly or monthly basis. RoboCoding is not just limited to reports; this technique can be expanded to include other procedures and data steps. Let the RoboCoder do the repetitive SAS® coding work for you!
Format Follows Function: User-Written Formats and User-Written Functions that talk to the SAS Metadata Server
SAS® now has a very simple interface that allows programmers to create user-written functions with PROC FCMP; they can be used as if they had been built into SAS all along. These functions can build upon existing functions in SAS and can also be used in custom user-written formats. This paper shows an example where a SAS dataset contains a metadata URI with a user-written format that calls a user-written SAS function that seamlessly contacts the metadata server behind the scenes and transparently turns a metadata ID into the actual displayed name or description of the object.
Writing Macro Do Loops with Dates from Then to When
Dates are handled as numbers with formats in SAS R software. The SAS macro language is a text-handling language. Macro %do statements require integers for their start and stop values.
This article examines the issues of converting dates into integers for use in macro %do loops. Three macros are provided: a template to modify for reports, a generic calling macro function which contains a macro %do loop and a function which returns a list of dates. Example programs are provided which illustrate unit testing and calculations to produce reports for simple and complex date intervals.
Database Vocabulary: Is Your Data Set a Dimension (LookUp) Table, a Fact Table or a Report?
This paper provides a review of database vocabulary and design issues. It reviews the categories of variables and tables in a relational database and offers tools to categorize variables in a data set and recode them so that the data set meets all the criteria of a relational database table.
This paper provides a review of database vocabulary and design issues. It reviews the categories of variables and tables in a relational database and offers tools to categorize variables in a data set and recode them so that the data set meets the criteria of a database table.
Report Dates: %Sysfunc vs Data _Null_;
It is the purpose of this presentation to illustrate %Sysfunc as an alternative
to the DATA _Null_ technique for creating report dates in macro variables.
Summarizing Character Variables Using SAS® Proc Report
Priya Suresh and Elizabeth Heath
The SAS® report procedure, Proc Report, is feature rich and is ideal for summarizing numerical values and flagging data outliers. Sometimes the values to be summarized are character variables, such as locations or names. In this paper, we will present a nifty technique to summarize character variables using SAS® Proc Report, show how to flag outlier values, and point out a few idiosyncrasies of Proc Report.
Proc Compare – The Perfect Tool for Data Quality Assurance
In the data warehouse environment, Quality is King. If you want to assure that the data in your warehouse correctly reflects the sources of the data, you have to compare your warehouse data back to the source. SAS has the perfect tool for the job – Proc Compare – and along with a few other tools from your SAS toolbox, you’ll be able to ensure that the data is right (or not!).
The Power of Combining Data with the PROC SQL
Combining two data sets which contain a common identifier with a MERGE statement in a data step is one of the basic fundamentals of data manipulation in SAS. However, one quickly realizes that there are many situations in which a simple merge grows significantly more complicated. Real world data is usually not straightforward and often analyses require combining radically different databases. These situations require a more sophisticated type of merging. Using the SQL procedure, instead of the more traditional data step, is a powerful solution to merging data in complicated situations.
This paper will demonstrate how to combine data using PROC SQL from the simplest of merges to the complex.
Comparisons of SAS Mixed and Fixed Effects Modeling for Observed over Expected Count Outcomes in the Presence of Hierarchical or Clustered Data
Rachel E Patzer and Laura Plantinga
There are numerous SAS modeling approaches that can be used to model an outcome of a standardized ratio measure (observed/expected counts), such as the frequently encountered Standardized Mortality Ratio (SMR). The purpose of this paper is to examine facility-level predictors associated with a standardized ratio measure -- the Standardized Transplant Ratio (STR) - by comparing mixed-and fixed-effects modeling approaches in an analysis of dialysis facilities nested within 18 geographical regions of the US.
In a cross-sectional, multi-level ecologic study using the publicly available Dialysis Facility Report (2007-2010) data, we examined 3,983 dialysis facilities across the US. STRs were defined as the number of observed kidney transplants within a dialysis facility divided by the number of expected transplants. We considered the outcome as both linear (STR and log-transformed STR), and count (observed counts as outcome with expected counts or person-years as offsets). We utilized random effects and generalized estimating equation modeling to account for correlation of facilities within regions. We considered SAS proc mixed to examine fixed and random effects and proc glimmix to further examine random effects with the linear outcomes STR and log-STR. We used SAS proc genmod (fixed effects) and proc glimmix (mixed effects) to examine count outcomes, using a log link and the negative binomial distribution to account for overdispersion.
While SAS proc glimmix models did not converge, overall, the other various modeling strategies in SAS gave similar answers about the magnitude and significance of facility-level predictors. Linear mixed effects models allow for random effects at the network level, but the model assumes normality of the outcome and residual errors (which are violated), and interpretation of log-transformed STR is not intuitive. SAS proc genmod with a negative binomial distribution using transplant counts as the outcome and person-years as the offset does not allow for random effects, but has the advantage of expected counts not previously being modeled. Modeling results using the count outcome and expected counts as the offset were similar to those using observed counts and person-years only and exponentiated beta estimates are easily interpretable as change in STR associated with unit change in predictor.
Selecting Earliest Occurrence: Watch Your Step
In processing statistical data, it is often necessary to select the earliest occurrence of some event, e.g., the earliest diagnosis or treatment date or perhaps the earliest purchase or payment. This seems straight-forward enough logically, however, it is all too easy to implement it not quite right when using the SAS data step. That would not be so bad if such errors generated errors or warnings, but for several important errors there is nothing issued to alert the analyst. Rather, the result looks pretty good but is off by a little and that, for the conscientious analyst, is a little too much. We examine five successive attempts at a solution for this apparently simple problem and show the error in the first four.
Handling data with multiple records per subject: 4 quick methods to pull only the records you want
Often we are interested in only the latest record for a subject, or the record with the subjects' highest value of a certain field, but the data are stored with multiple records per subject. This paper covers 4 techniques to generate a data set with unique records for each subject, based on specific criteria, and what to do when more than one record meets the criteria. The techniques covered use PROC SQL, PROC SORT, PROC RANK, and a DATA STEP with by-group processing.
Fit Discrete Distributions via SAS Macro
Discrete Distribution like Poisson, Binomial, Negative Binomial, Beta Binomial are important distributions in modeling event data like number of Losses or Claims in Finance/Insurance, frequency of diseases, number of phone calls occurring in a given period of time etc. SAS STAT provides procedures like PROC GENMOD to facilitate the fitting, but it lacks graphic comparison between observed density vs expected one, also users have to run separate analyses for different types of distributions. We develop SAS macros to do sampling on large dataset, distribution fitting, goodness of fit testing like Chi-Square tests all together, and provide a final comparison table for decision making. We also put different types of fitted distributions like Poisson, Negative Binomial, and original data in one chart to compare it visually. With the assistance of SAS macro, users can fit different types of discrete distributions on sampled data and see combined result including both fit statistics and graphics in a few easy-to-understand options in one step.
The BEST. Message in the SASLOG
Your SAS® routine has completed. It is apparently a success – no bad return codes, no ERROR or WARNING messages in the SASLOG, and a nice thick report filled with what appear to be valid results. Then, you notice something at the end of the SASLOG that warns you that, somewhere in your output, one or more numbers are NOT being printed with the formats you requested. NOTE: At least one W.D format was too small for the number to be printed. The decimal may be shifted by the ""BEST"" format.
This presentation will review an ad hoc that can be used to quickly identify numbers that are too large for your selected format – certainly much quicker and effectively than attempting to eyeball a massive report!
Array Applications in Determining Periodontal Disease Measurement
Liang Wei, Laurie Barker and Paul Eke
A SAS array can be defined as a series of related variables under a single name. It can be created with the array statement within a data step and can provide a way for repetitively processing variables using a do-loop. SAS arrays can be used for creating or modifying a group of variables, reshaping a data set or comparing across observations. When SAS arrays and do-loops are properly used, they can become powerful data manipulation tools which make the SAS code more concise and more efficient. This paper introduces one-dimensional and two-dimensional array applications for manipulation of repeated measures data, using periodontal disease data for illustration. The coding approach can be particularly helpful for other researchers and data analysts working in oral health and other medical fields and can also be helpful for SAS programmers working on the similar repetitive task.
Automating Data Vetting Using SAS Macros
The first step in building a predictive model is to understand the quality of the available data. When data sets are the size of those typically used in an educational setting, this can be done manually. But in today’s world of big data, data sets having multiple hundreds of fields is common and a field by field invocation of proc freq is infeasible. We present several SAS macros that automate the vetting process. Extraction of metadata is illustrated through the use of %sysfunc while statistics such as max, min, mean, median, number and percent missing are automatically generated for each numeric field. Reports for character fields are given that note the longest, shortest and most frequently occurring values. Recommendations as to how to use a field in modeling are also output to let the modeler know if there are an excessive number of missing values or it is probably a serial number or key field.
Using Arrays to Handle Transposed Data
Often times, particularly in the health care field, programmers and analysts need to deal with multiple repeated observations (or patients), with the task of searching for a desired string, number or other trait within a certain variable. PROC TRANSPOSE can be utilized to transpose the data so that each record has a unique observation, with the transposed variable displayed across multiple columns. Then an array is processed throughout the multiple columns, using find, scan, substr or other character function, and returns a new variable with a value of 1 if the transposed variables contain the desired information. Additional code must be written to return a value of 0 if the transposed variables do not contain the desired information. Other techniques and tools are also discussed to enhance the efficiency of this trick. All techniques are intended for the working knowledge of the Base SAS programmer, with the hope that the provided simple, straightforward array code will make the task of seeking characteristics of observations much easier.
How SAS Processes If-Then-Else Statements
Michael Leitson and Elizabeth Leslie
If-Then-Else statements are relatively simple and convenient to use in SAS. However, one must be careful to write correct SAS syntax in order to achieve the desired data step processing. A single misplaced or forgotten statement can lead to erroneous results. This paper examines the proper way to produce if-then-else statements and also seeks to avoid incorrect data sets. The audience is intended for beginning SAS users who are learning about the joys and pitfalls of if-then-else statements.
Don't Get the Finger... You Know the FAT Finger Creating a Modular Report Approach using BASE SAS
As SAS report developers we often use the same bits and pieces of code for different reports and analyses. We take a morsel of code from this program and combine with a scrap of code from that program to create a new analysis. While retreading code is something we all do, surprisingly it can be time consuming and can often lead to the dreaded phenomena of fat fingering. By using a combination of SAS Base procedures and functions such as PROC Report, ODS, and SAS Macros, you can create a modular reporting approach from which you can choose bits and pieces of your most useful code lines, yet minimize the amount of necessary tweaking. By developing this modular approach, you can save time, standardize your report output, minimize coding errors and avoid getting the fat finger.
PROC FORMAT in DATA Step mathematics
PROC FORMAT and data step mathematics can be used to bypass computational limitations to calculate probability estimates of exceedingly rare events. A client needed to assess the likelihood of finding a de-fect, given that one hadn’t yet been found in thousands of tests. Standard binomial tables extend only to 500 trials. The formula cannot be calculated directly and even the numeric approximation was intractable given the available hardware. The numbers even exceeded the capacity of SAS’s combination and facto-rial functions. A review of publications extending back through 1964, the application of mathematical methods to simplify calculations, and a custom-written PROC FORMAT and SAS data step led to an an-swer for the client -- vanishingly small.
Using Heatmaps and Trend Charts to Visualize Kidney Recipients’ Post-Transplant Hospitalizations
In trying to understand the hospital resource utilization of kidney transplant patients after surgery, it is important to know how often patients have to be readmitted to the hospital, how long they spend in the hospital and what factors influence their readmission rates. As a component of exploratory analysis, data visualization was essential to elucidate patterns and relationships between study variables. This study employed heatmaps and trend charts summarizing the percent of patients hospitalized to understand hospitalization rates over time.
Heatmaps were developed to show every hospitalization of every patient in our data set over a 3 year period. Patients were listed on the y axis and the x axis showed time from surgery. Times that a patient spent in the hospital were shown as horizontal bars so you could see how many days post-op the patient re-entered the hospital and, by the length of the bar, the length of that particular admission. The bars were color coded by number of readmission. These heatmaps were useful to get a general sense of the pattern of hospitalizations for the whole population and investigate the hospitalizations of individual patients. It was easy to see for example, that patients with early readmissions tended to have more readmissions in their future.
Heatmaps, however, were not good at differentiating patterns of hospitalizations between groups of patients. For this reason, we used summary data of hospitalizations. For each 30 day period post op, the percent of patients who were hospitalized during that time span was calculated within each subgroup of interest. This percent was then shown on a line graph with time on the x-axis. By graphing two or more groups of patients on the same graph the trends and differences in patterns were observed. Non-diabetics patients, for example, were shown to have slightly lower hospitalization rates than diabetic patients over the whole 3 year period.
This study showed heatmaps and trend charts can each be useful in visualizing hospitalization patterns in different ways. Using both of these tools we can continue to explore what factors affect hospitalization rates in our local kidney transplant population.
Resistance is Futile: SAS Enterprise Guide Assimilates PC SAS
You’ve always used PC SAS (aka Base SAS) to do your SAS development and are quite happy with it. But now your manager is asking you to work on some SAS EBI (Enterprise Business Intelligence) projects, creating stored processes. Or maybe your company has decided to eliminate PC SAS and change everyone over to SAS Enterprise Guide. Or, your just not sure what it means when you hear ‘the grid’ talked about in SAS Enterprise Guide. We’ll compare PC SAS 9.1, SAS Enterprise Guide 4.1 (v9.1) and SAS Enterprise Guide 5.1 (v9.2 and v9.3), and help you make a smooth transition.
The Short-Order Batch
A project that documents pedestrian and bicycle crash attributes and locations requires manual data entry. The image of each crash report form, contained in a TIFF file, must be examined to categorize the crash attributes, and to geolocate the crash. We equip the person performing this task with a Google Earth KML file showing the approximate location, and an Excel worksheet with other pertinent information. Each year, there are approximately 3500 of these crashes, requiring an unknown number of coders. In the past, a single, large KML file and Excel worksheet were created for the coder. When more coders were hired, the existing files had to be split and a fortune-teller predicted what amount of data to allocate to each coder. Further divisions were always required, and re-assembly was ugly.
The SAS code we examine makes little molehills out of a mountain of raw data. When the coder is ready, a small batch of information appears. The code discovers filenames from the operating system, creates a batch folder, places an Excel worksheet and a SAS program in the batch folder. The user then customizes and runs the new SAS program to create the KML file.
How Many Licks to the Center of that Column?
Sometimes centering statistics just does not quite present data the way it should. Scanning down a column of means, ranges, p-values, et cetera takes more effort than it should due to alignment issues. What if all these stats could be aligned more readably? This article aims to solve this question by presenting a dynamic algorithm to align statistics along a vertical column, on a common integer or punctuation mark across any given statistic. As input, the algorithm needs the column width in order to properly indent the data. Additionally, to display each column alike, it needs the number of columns. The techniques present herein offer a good overview of basic data step and array programming appropriate for all levels of SAS capability. While this article targets a clinical computing audience, the techniques apply to a broad range of computing scenarios.
Using SAS to calculate Modularity of a Graph for Community Detection Problems
Modularity is a measure that was proposed by Girvan and Newman in 2004 to be used to measure the structure of a network or a graph. A network with a higher Modularity have stronger connections within modules or communities and weaker connections between modules. This is a popular measure in graph theory recently and has been explored as a quality function to partition a graph into multiple communities by maximizing its value. While there are many software implementations to calculate the Modularity measure of a network, there are no procedures existed in SAS software to calculate this measure for a network.
This paper is one of first attempts to calculate this measure with SAS base software. In order to calculate this measure, we first define a network structure and then use one classic network data to test this macro to compare our results with other software. We also tried to partition a small network with this macro to test its use of community detection with base SAS software. This method has limitation of only working with small networks. Further discussions will be followed to use SAS/IML to calculate large network more
Explore RFM approaches using SAS
RFM (Recency, Frequency, and Monetary value) has been one of the most widely used methods in direct marketing and database marketing over a long time (Bauer et al., 2002). In spite of more sophisticated statistical models developed in data mining recently, marketers continue to deploy RFM models because it is a way that decision makers can effectively profile customers and identify valuable customers. Newer data mining techniques like decision tree and logistic regression have been incorporated into RFM model to make meaningful segments and improve predictability (McCarty and Hastak, 2007). Unfortunately, user has very limited access to RFM modeling in SAS. SAS/EG has a RFM module which only provides equal bucket binning and quantile binning to rank recency, frequency and monetary, and also the module doesn’t provide the flexibility to use different binning methods among the three measures. In this paper, we will provide SAS macro code which implements the full functional RFM analysis including not only bucket and quantile binning but also logistic regression, CHAID, and other methods. The macro we developed also extend RFM analysis and apply RFM score in general predictive modeling.
Hands On Workshops
SAS® Enterprise Guide® 5.1: A Powerful Environment for Programmers, Too!
Have you been programming in SAS® for a while and just aren't sure how SAS® Enterprise Guide® can help you?
This presentation demonstrates how SAS programmers can use SAS Enterprise Guide 5.1 as their primary interface to SAS, while maintaining the flexibility of writing their own customized code.
- navigating and customizing the SAS Enterprise Guide environment
- using SAS Enterprise Guide to access existing programs and enhance processing
- exploiting the enhanced development environment including syntax completion and built-in function help
- using SAS® Code Analyzer, Report Builder, and Document Builder
- adding Project Parameters to generalize the usability of programs and processes
- leveraging built-in capabilities available in SAS Enterprise Guide to further enhance the information you deliver.
Know Thy Data: Techniques for Data Exploration
Andrew Kuligowski and Charu Shankar
Get to know the #1 rule for data specialists: Know thy data. Is it clean? What are the keys? Is it indexed? What about missing data, outliers, and so on? Failure to understand these aspects of your data will result in a flawed report, forecast, or model.
In this hands-on workshop, you learn multiple ways of looking at data and its characteristics. You learn to leverage PROC MEANS and PROC FREQ to explore your data, and how to use PROC CONTENTS and PROC DATASETS to explore attributes and determine whether indexing is a good idea. And you learn to employ powerful PROC SQL’s dictionary tables to explore and even change aspects of your data.
A Row is a Row is a Row, or is it? A Hands-on Guide to Transposing Data
Sometimes life would be easier for the busy SAS programmer if information stored across multiple rows were all accessible in one observation, using additional columns to hold that data. Sometimes it makes more sense to turn a short, wide data set into a long, skinny one -- convert columns into rows. Base SAS® provides two primary methods for converting rows into columns or vice versa – PROC TRANSPOSE and the DATA step. How do these methods work? Which is best suited to different transposition problems? The purpose of this workshop is to demonstrate various types of transpositions using the DATA step and to “unpack” the TRANSPOSE procedure. Afterwards, you should be the office go-to gal/guy for reshaping data.
How To DOW
The DOW-loop is a nested, repetitive DATA step structure enabling you to isolate instructions related to a certain break event before, after, and during a DO loop cycle in a naturally logical manner. Readily recognizable in its most ubiquitous form by the DO UNTIL(LAST.ID) construct, which readily lends itself to control break processing of BY group data, the DOW loop's nature is more morphologically diverse and generic. In this workshop, the DOW-loop's logic is examined via the power of example to reveal its aesthetic beauty and pragmatic utility. In some industries like Pharma, where flagging BY group observations based on in-group conditions is standard fare, the DOW-loop is an ideal vehicle, greatly simplifying the alignment of business logic and SAS code. In this hands-on workshop, the attendees will have an opportunity to investigate the program control of the DOW-loop step by step using the SAS DATA step debugger.
How to Use ARRAYs and DO Loops: Do I DO OVER or Do I DO i?
Do you tend to copy DATA step code over and over and change the variable name? Do you want to learn how to take those hundreds of lines of code that do the same operation and reduce them to something more efficient? Then come learn about ARRAY statements and DO loops, powerful and efficient data manipulation tools. This workshop covers when ARRAY statements and DO loops can and should be used, how to set up an ARRAY statement with and without specifying the number of array elements, and what type of DO loop is most appropriate to use within the constraints of the task you want to perform. Additionally, you will learn how to restructure your data set using ARRAY statements and DO loops.
Introduction to Interactive Drill Down Reports on the Web
Michael Sadof and Louis Semidey
Presenting data on the web can be an intimidating project for those who are not familiar with the technology. Luckily, SAS provides users with a method of presenting dynamic reports on a web server utilizing basic SAS syntax and even legacy reports. This workshop will walk you through several methods of utilizing Proc Report and will teach you some HTML code to present interactive reports on your web server. This technique requires you have several BI components running on your server but does not require advanced knowledge of SAS BI or HTML. In this workshop, we will start with the basics and develop a roadmap for producing dynamic reports with Stored Processes without using OLAP cubes or the OLAP Server.
The Armchair Quarterback: Writing SAS® Code for the Perfect Pivot (Table, That Is)
“Can I have that in Excel?" This is a request that makes many of us shudder. Now your boss has discovered Excel pivot tables. Unfortunately, he has not discovered how to make them. So you get to extract the data, massage the data, put the data into Excel, and then spend hours rebuilding pivot tables every time the corporate data are refreshed. In this workshop, you learn to be the armchair quarterback and build pivot tables without leaving the comfort of your SAS® environment. In this workshop, you learn the basics of Excel pivot tables and, through a series of exercises, you learn how to augment basic pivot tables first in Excel, and then using SAS. No prior knowledge of Excel pivot tables is required.
Extend the Power of SAS® to Use Callable VBS and VBA Code Files Stored in External Libraries to Control Excel Formatting Routines
Did you ever wish you could use the power of SAS® to take control of EXCEL and make EXCEL do what you wanted WHEN YOU WANTED? Well one letter is the key to doing just that, the letter X as in the SAS X Command that opens the door to all operating system commands from SAS. The Windows operating system comes with a facility to write a series of commands called scripts. These scripts have the ability to open and reach into the internals of EXCEL. Scripts can load, execute and remove VBA macro code and control EXCEL. This level of control allows you to make EXCEL do what you want, without leaving any traces of a macro behind. This is Power.
Give the Power of SAS® to Excel Users Without Making Them Write SAS Code
Merging the ability to use SAS and Excel can be challenging. However, with the advent of SAS Enterprise Guide, SAS Integration Technologies, SAS BI Server software, SAS JMP software, and SAS Add-ins for Microsoft Office products; this process is less cumbersome. Using Excel has the advantages of being cheep, available, easy to learn, and flexible. On the surface SAS and Excel seem widely separated without these additional SAS Products. But wait, BOTH SAS AND EXCEL CAN INTERFACE WITH THE OPERATING SYSTEM. SAS can run Excel using the X command and Excel can run SAS as an APPLICATION. This is NOT DDE; each system works independent of the other. This paper gives an example of Excel controlling a SAS process and returning data to Excel.
When Little Objective Data Are Available, Find Root Causes and Effects with Interrelationship Digraphs and JMP®
The Interrelationship Digraph (ID) is one of seven new Quality Control tools described by Shigeru Mizuno (Management for Quality Improvement: The New 7 QC Tools, Cambridge, MA, Productivity Press, Inc., 1988).
IDs show cause-and-effect relationships between several items, ideas, factors, or issues. IDs are useful in exploring relationships between ideas; prioritizing choices when decision makers find it difficult to reach consensus; and sorting out issues involved in project planning, especially when credible data may not exist.
IDs provide a means of evaluating ways in which disparate ideas influence one another. IDs make it easy to spot leading factors that affect other factors by blending Cause-and-Effect thinking from Ishikawa diagrams and the creative logic of brainstorming that respond to frequent complaints about Cause-and-Effect diagrams: “What are the most important causes among many choices available?” and “How do they interact or connect to each other?”.
This presentation will show ways JMP can construct IDs to identify and process ideas that drive process improvement efforts. JSL scripts will create customized reports combining traditional graphical and matrix ID formats.
From Raw Data to Beautiful Graph using JSL
JSL is a powerful tool for manipulating raw data into the form needed for easy visualization in JMP. This talk walks through a working script that transforms raw iOS sales data into an easy-to-interpret graph. Along the way, we learn how to summarize data, add formulas, add complex column properties, and add table scripts all with JSL.
I'm a SAS Programmer. Why should I JMP?
JMP® software, when used as a stand-alone package, provides a variety of ways of understanding, visualizing and communicating what your data is telling you. Adding JMP software and functionality to a SAS programming environment can result in the best of both worlds. Topics covered include previewing SAS data, running SAS procedures from JMP, using JMP for further exploration of SAS results, and using SAS geographic data with JMP. Through these features and others in JMP's point-and-click environment, JMP extends the power and functionality of SAS.
Using JMP® Partition to Grow Decision Trees in Base SAS®
Decision Tree is a popular technique used in data mining and is often used to pare down to a subset of variables for more complex modeling efforts. If your organization has only licensed Base SAS and SAS/STAT you may be surprised to find that there is no procedure for decision trees. However, if you are a licensed JMP user, you can build and test a decision tree with JMP. The Modeling→Partition analysis provides an option for creating SAS Data Step scoring code. Once created, the scoring code can be run in Base SAS. This discussion with provide a brief overview of decision trees and illustrate how to create a decision tree with Partition in JMP and then create the SAS Data Step Scoring code.
Planning and Administration
Survey of Big Data Solutions using SAS(r) Technologies
When used in the context of using SAS(r) technologies, the term "big data" means one of two things: handling large volumes of data and performing statistical analysis in near real time, or distributing a large processing problem over multiple servers. SAS(r) offers a grid solution for the latter and two solutions for former: in data base and in memory. This paper offers a brief survey of each solution and what sorts of business problems each can solve.
Rebuilding SAS Web Application for Web Report Studio 4.3
SAS® Enterprise Business Intelligence Web Applications can be customized based on organizations need. In order to customize Web Applications settings and configuration, EAR files needs to be rebuilt and redeploy for the seamless access to SAS® Web applications. This paper illustrates and focuses on, required steps to install SAS® 9.3 Application Server, Web Application Server - JBoss Server-4.2.3 GA in Windows Server 2008 Environment and discuss rebuilding .EAR files scenarios for SAS® Web Report Studio 4.3
The Many Hats of a SAS Administrator: An Insider’s Guide on Becoming an Indispensable Asset in Your Organization
With the increased focus on analytics across all sectors of the economy, SAS systems, and by extension, capable SAS Administrators, are in high demand. Because there is a dramatic shortage of supply to meet this ever increasing demand, getting flooded with interview requests from recruiters is as easy as posting your resume with the keyword 'SAS' to LinkedIn or online job boards. Becoming an indispensable asset to the organization that hires you as a SAS Administrator is not as easy. It's going to require you to be adept at diagnosing and solving complex problems, enthusiastic when faced with a constant stream of new technology, responsive to management initiatives and policies, well versed in the business value of SAS, proficient in technical writing, and genuinely empathetic to the needs and concerns of your internal and external SAS system users. In this paper, I outline a step by step approach to increasing your visibility within your organization as the go-to SAS Guru.
The Hitchhiker's Guide to Github: SAS Programming Go Social
Don't Panic! Github is a fantastic way to host, share, and collaborate your codes (of course, SAS included). You will like it: first, it’s totally free (for all public repositories); and second, it has an easy use GUI client, Github for Windows if you work in Windows OS and you don’t feel comfortable to write Git commands on which Github is based. There is no additional installation and configuration needed to use Github.
Aware or not, there are bunch of SAS programmers playing with Github, but in general, it’s still new to SAS community, especially compared the R community. This paper takes a step-by-step approach to introduce Github with its fabulous GUI client to submit, revise, sync and socialize your codes. All supporting materials are also available on author’s Github webpage.
Your Analytics project is going to fail... Ask me why.
Adam Hood and Martin Young
SAS projects of any size regularly encounter unexpected issues. These issues typically evolve from a misunderstanding of the administrative side of the SAS platform. This includes platform performance, data access, and data storage. Organizations and business users get frustrated with the SAS platform when they experience performance problems and don't understand the underlying architectural reasons. This has a negative impact on their work, their projects and their perception of the platform.
As an example, we recently completed a large text mining project for a Fortune 100 client. The scope and complexity of the project were going to cause a strain on the architecture that created a risk for its success. To mitigate this, we implemented new project processes aimed at educating business users on the platform and specific constraints that would effect the success of the project - for example, the impact of coding choices for text mining. This process was further refined into a set of questions to ask for every project going forward. These questions are intended to be answered by the project group, including sponsors, business users, analysts, SAS administrator(s) and data subject matter experts.
The questions center on these four topics:
Exploring these four topics help the project team discover and mitigate architecture problems early on in the analytics project. As a result:
- Data access
- Data size
- Data requirements (and the difference between wishlist and strict requirements)
- Data organization and management
For this pilot project, the team delivered a fairly complex analytical project on time and all parties walked away successful - and with much less stress. We believe this will serve as an example to future project teams of how to engage with SAS administrators to facilitate project success.
- Timelines become more realistic
- Risks are better known and tracked
- Issues are addressed up front - instead of in the final push to production
- Most importantly, the business is focused on analytics results and the associated ROI
A Practical Approach to Process Improvement Using Parallel Processing
In applications which process huge volumes of data for analysis purposes, it is often essential to minimize pro-cessing time to increase efficiency. Large data volumes (in the author’s experience, row counts of 1.5 million and up to 2500 columns) result in very long processing times of about 4 to 5 days. It is often required to reduce the execution time for such processes. This paper discusses the following generic steps for improving any process. (1) Identify areas for improvement (understand the process thoroughly; analyze logs to identify steps which take the longest amount of time). (2) Look for certain processes that can be executed in parallel (actual parallelization of independent processes executed one after another; automated virtual multithreading; executing multiple in-stances of jobs in UNIX). (3) Adopt alternative methods for performing tasks faster (sorting datasets using threads; use of SET statement and KEY= option; use of SYNCSORT® for merging in SAS; intelligent use of in-dexes for merging; use of user-defined formats). (4) Modularize complex steps and creating macros for perform-ing repeated tasks. (5) Reduce code redundancy (removing unwanted code; use of functions and macros for performing repeated tasks). (6) Syntactic optimization and deletion of large intermediate data sets. The above techniques optimize the process to allow faster data delivery by reducing the execution time.
Parallel processing techniques for performance improvement for SAS processes: Part II
Paper is an extension to an earlier paper titled “A practical approach to process improvement using parallel processing” by the same author. It explores use of newer, faster, practical and applicable parallel processing techniques supported by SAS 9.2 and later versions that can be used for processing large volume of data in parallel on AIX UNIX as well as windows SAS environments. It further dwells on remediating some limitations observed earlier such as: a. identifying efficient ways to implement parallel processing, b. identifying number of threads to process in parallel, c. using newer SAS 9.2 procedures (such as SCAPROC) for multithreading SAS, d. analyze advanced SAS 9 support for SMP computers having multiple CPUs and operating environments that can spawn and manage multiple threads. Paper also covers techniques that analyze and monitor process performance. It discusses a practical case study with detailed examples comparing execution times of the established benchmarks vs. time after implementing parallel processing resulting in a 77% improvement from over 96 hours of real time to about 22 hours of real time.
I heart SAS Users
In my 20 some years experience, I have found SAS users to be smart, helpful and highly motivated. SAS users are just great. I would like to share some of the experiences I have had in supporting SAS users.
I am very humbled whenever I find a SAS user snagged in a simple error and am able to be of assistance in helping to resolve the error. It is an honor and a privilege to work with such magnificent people as the much heart-ed SAS users.
The Disk Detective: A Tool Set for Windows SAS© Administrators
Ever needed to know detailed information about a file on your disk subsystem, such as: who owns the file, when was it modified, how much space does it take? Ever wonder how much free space is on my server before I run this huge SQL
query? Luckily with SAS© we can directly access the functions within the Windows API to accom-plish these and many more tasks by using the SASCBTBL attribute table and the MODULE family of call routines and functions.
This paper will demonstrate how to setup the SASCBTBL attribute table to be able to call all the Windows func-tions needed to get a complete picture of the file structure on a disk drive.
A Hitchhiker's guide for performance assessment & benchmarking SAS® applications
Almost every IT department today, needs some kind of an IT infrastructure to support different business processes in the organization. For a typical IT organization the ensemble of hardware, software, networking facilities may
constitute the IT infrastructure. IT infrastructure is setup in order to develop, test, deliver, monitor, control or support IT services. Sometimes multiple applications may be hosted on a common server platform. With a continual increase in the user base, ever increasing volume of data and perpetual increase in number of processes required to support the growing business need, there often arises a need to upgrade the IT infrastructure.
The paper discusses a stepwise approach to conduct a performance assessment and a performance benchmarking exercise required to assess the current state of the IT infrastructure (constituting the hardware and software) prior to an upgrade. It considers the following steps to be followed in order to proceed with a planned approach to implement process improvement.
- Phase I: Assessment & Requirements gathering
- Understand ASIS process
- Assess AIX UNIX server configuration
Phase II: Performance assessment and benchmarking
- Server performance
- Server Utilization
- Memory Utilization
- Disk Utilization
- Network traffic
- Resource Utilization
- Process Performance
- CPU Usage
- Memory usage
- Disc space
- Phase III: Interpretation of results for performance improvement
Increasing College Tuition and Its Impacts on Student Loans
Harjanto Djunaidi and Monica Djunaidi
There are many popular articles that have been published in the beginning of 2013 on college cost and student loans. Many questions have been asked why college tuition kept increasing over many years in the past. Most of the articles written based on the assumption that the next financial crises could be triggered by student loans bubble which has surpassed credit card loans. The business sector in the US is nervous as the potential default of the student loans is imminent. Slower than expected US employment growth have added the anxiety of the market. Knowing what factors have caused the loans to increase is the research question that this study is trying to answer. Multivariate statistical analyses such as factor analyses applied on government published data collected through IPEDS which made available through NCES website will be utilized to find the potential answers.
SAS Enterprise Business Intelligence Deployment Projects in the Federal Sector
Systems engineering life cycles (SELC) in the federal sector embody a high level of complexity due to legislative mandates, agency policies, and contract specifications layered over industry best practices all of which must be taken
into consideration when designing and deploying a system release. Additional complexity stems from the unique nature of ad-hoc predictive analytic systems that are at odds with traditional, unidirectional federal production software deployments to which many federal sector project managers have grown accustomed. This paper offers a high-level roadmap for successful SAS EBI design and deployment projects within the federal sector. It’s addressed primarily to project managers and SAS administrators engaged in the SELC process for a SAS EBI system release.
Pharma & Healthcare
Coding For the Long Haul With Managed Metadata and Process Parameters
How robust is your SAS code? Put another way, as you look through your program, how sensitive is it to changing circumstances? How much code is affected when, for example, the names of data sets to be analyzed or the names of variable within those data sets change? How are those changes expected to be implemented? In this paper we discuss program optimization and parameter management through the use of metadata. In the wide open, free-text environment of Base SAS, we too often worry more about getting results out the door than producing code that will stand the test of time. We’ll learn in this paper how to identify process parameters and discuss programming alternatives that allow us to manage them without having to touch core code. We’ll look at SAS metadata tools such as SQL Dictionary tables and PROC CONTENTS, as well as tools for reading and processing metadata, such as CALL EXECUTE. Finally, within our industry, we’ll take a brief look at how the Clinical Standards Toolkit puts these methods into practice for CDISC compliance checking. This paper is intended for intermediate-level Base SAS users.
Imputing Dose Levels for Adverse Events
John R Gerlach and Igor Kolodezh
Besides the standard reporting of adverse events in clinical trials, there is a growing interest in producing similar analyses in the context of exposure to treatment drug at the onset of an adverse event. Given an ADaM data set containing adverse events (ADAE), the intended analysis initially requires the inclusion of a variable denoting the dose level at the onset of an adverse event, called DOSEAEON. This variable would contain a null value for non-treatment emergent events, zero for placebo, and indicate the dose level, otherwise. Moreover, DOSEAEON would be used to create grouping variables for the actual analysis. This paper discusses the challenges of implementing a hierarchical methodology to determine the dose level at the onset of an adverse event.
Identifying patient characteristics towards reducing hospital readmissions: Propensity Score Matching using JMP Pro
Pradeep Podila, George Relyea and Daniel Clark
Hospitalizations, overall, account for approximately 31 percent of total health care expenditures. Moreover, statistics show that 18 percent of Medicare patients discharged from hospitals are readmitted within 30 days of discharge, costing about $15 billion. These skyrocketing costs are a hefty burden on the hospitals and the government.
Yet hospital readmissions do not necessarily indicate low quality of care during the hospitalization period because there are a number of risk factors that contribute toward these readmissions. Some of these factors include premature discharge, complications, such as a drug interaction, socio-economic factors, poor transitions between different providers and care settings, discharges to inappropriate settings, failure to receive adequate information or resources to ensure continued progression and gaps in coordination of care, communication and information exchange between inpatient and community-based providers.
According to Section 3025 of the Affordable Care Act (2010), Congress directed the Centers for Medicare and Medicaid Services (CMS) to reduce its payments to hospitals with high readmission rates and to penalize those hospitals with worse than expected 30-day hospital readmission rates. So, Methodist Le Bonheur Healthcare (MLH) System plans to proactively utilize SAS® JMP Pro 10, to develop a Propensity Score Matching (PSM) model to identify the patient characteristics after the readmitted patients are matched with those who were not readmitted on non-modifiable risk factors such as- age, race, gender etc., to identify if readmissions are chronically attributable to characteristics such as- diagnosis code, medications etc. The analysis will be conducted on 351,908 (All Payor) and 134,272 (Medicare) discharges data from: January 2007 – December 2012. The data set contains 60 different variables such as- Length of Stay (LOS), Number of Medications, Home Area - Zip Code, Number of Diagnosis Codes, Primary and Secondary Diagnosis Codes, Procedure Codes, Family Support, Home-Phone Number, Education Level, Employment Status, Marital Status etc.. This model will help MLH in designing care strategies that are directed towards potential readmits, thereby reducing the unplanned readmissions.
Patient Profile Graphs Using SAS®
Patient profiles provide information on a specific subject participating in a study. The report includes relevant data for a subject that can help correlate adverse events to concomitant medications and other significant events as a narrative or a visual report.
This presentation covers the creation of the graphs useful for visual reports from CDISC data. It includes a graph of the adverse events by time and severity, graphs of concomitant medications, vital signs and labs. All the graphs are plotted on a uniform timeline, so the adverse events can be correlated correctly with the concomitant medications, vital signs and labs. These graphs can be easily incorporated with the rest of the demographic and personal data of the individual patient in a report.
Using SAS to read, modify, copy, and create comments on a Case Report Form in .pdf format
Annotating Case Report Forms for a clinical study is a labor-intensive, manual process, subject to human error, and difficult to validate. Significant benefits can be achieved by reading, validating, modifying, copying, and creating these comments programmatically. While reading the comments is relatively easy, modifying and creating comments from scratch is very difficult, due to a lack of clear documentation on Adobe Acrobat file structure. This presentation will show examples using SAS to create the essential components necessary to gain control over all comments in a .pdf document. The presentation will conclude with a discussion of the value and potential application of these novel SAS tools. This technique is useful not to just the pharmaceutical industry, but to anyone who creates annotated .pdf documents.
The Baker Street Irregulars Investigate: Perl Regular Expressions and CDISC
Peter Eberhardt and Wei Liu
A true detective needs the help of a small army of assistants to track down and apprehend the bad guys. Likewise, a good SAS® programmer will use a small army of functions to find and fix bad data. In this paper we will show how the small army of regular expressions in SAS can help you. The paper will first explain how regular expressions work, then show how they can be used with CDSIC.
What Do Your Consumer Habits Say About Your Health? Using Third-Party Data to Predict Individual Health Risk and Costs
The Affordable Care Act is bringing dramatic changes to the health care industry. Previously uninsured individuals are buying health insurance and consuming health care services differently. These changes are forcing insurers to reevaluate marketing, engagement and product design strategies.
The first step in addressing these challenges is understanding the financial risk of new entrants into the marketplace. How do you predict the risk of a person without any historical cost information? What if all you know is the name and address?
The finance industry has long been using third-party consumer data to predict future finance habits and credit risk. This paper takes a look at applying advanced analytics from SAS to third-party data for predicting health care utilization risk and costs.
Kaplan-Meier Analysis: A Practical Guide For Programmers
An important branch of statistics is survival analysis, which involves the modeling of time to event data. Within the context of clinical trials, this can represent the time between when a patient enrolls in a study and when a medically significant event occurs. Such analysis allows investigators to deduce, for example, the probability that an individual will survive past a certain time. A common problem in the analysis of clinical trials is how to appropriately consider censored data. The Kaplan-Meier (K-M) estimator of the survival function provides an elegant and robust method of survival analysis while properly handling censored data. Although it is common practice for SAS programmers in the research community and pharmaceutical industry to implement the proc lifetest procedure to generate outputs, a comprehensive understanding of K-M survival analysis is required the appropriately interpret results. The objective of this presentation is to not only demonstrate the correct implementation of the proc lifetest procedure when studying survival data but also to describe the statistical fundamentals, the underlying calculations and the appropriate analytical tools so that the reader is well equipped to incorporate K-M analysis in their own research.
Using the 7th Edition American Joint Committee on Cancer (AJCC) Cancer Staging Manual to Determine Esophageal Cancer Staging in SEER-Medicare Data
Johnita Byrd and Felix Fernandez
The purpose of this paper is to discuss the methodology used in creating a consistent esophageal cancer staging system through the use of the 7th edition AJCC Cancer staging for patients with data across a time period of 10 years
with different staging criteria.
In a study using SEER-Medicare data, we examined health care utilization in surgically treated beneficiaries with esophageal cancer. Cancer staging was required in determining primary outcomes for this study. Across the time span of our study population, two different AJCC staging systems were needed. In each system, there were slight differences to the way the staging was developed. As it was important to use a uniform cancer staging system for consistency, we chose to use the 7th edition (most recent) to create a cancer staging for the entire population so that we could make assumptions about the population in terms of the present.
In order to use the 7th edition AJCC Cancer staging system for all patients, we created algorithms in SAS. These algorithms included a number of variables across several years. First, we determined different locations of a tumor based on a primary site variable. The T, N, and M staging variables were categorized by AJCC differently at different time periods. We researched each variable to standardize the different T, N, and M stages. We then classified the list of histology types into three groups: Adenocarcinoma, Squamous Cell, or Other/Unknown stage groupings. The grade variable remained consistent across all years. Once all variables were uniform, algorithms were created to calculate the most recent AJCC Cancer staging, based on literature explaining the development of staging for esophageal cancer in the 7th Edition of the AJCC Cancer Staging Manual.
Methods used to derive the 7th edition AJCC Cancer stages can be expanded and utilized in developing updated staging systems for other cancers. Such methods allow the use of the most recent staging system regardless of the dates of the SEER-Medicare data, especially since SEER has not updated its data to the most recent version.
Survey of Population Risk Management Applications Using SAS(r)
The business of health insurance has always been to manage medical costs so that they don't exceed premium revenue. The PPACA legislation which is now in full force amplifies this basic business driver by imposing MLR thresholds and establishing other risk-bearing entities like ACOs and BPCI conveners. Monitoring and knowing about these patient populations will mean the difference between success and financial ruin. SAS(r) software provides several mechanisms for monitoring risk management including OLAP cubes, Enterprise Miner, and Visual Analytics. This paper surveys these SAS(r) solutions in the context of the population risk management problems now part of the healthcare landscape.
Using SAS to Examine Social Networking Difference between Faculty and Students
Abbas Tavakoli, Joan Culley, Hein Laura, Blake Frazier and Williams Amber
Social networking is very important for nursing students and faculty. There are many social networking websites, such as Twitter, Yammer, LinkedIn, and Facebook. Facebook is by far the most popular and perhaps most well known social networking website. The purpose of this presentation is using SAS to examine if there was a difference on social networking beliefs and practice between faculty and students. A survey was developed and sent electronically to all nursing students and faculty members at a university based college of nursing located in the southeastern United States. PROC FREQ and T-TEST were used to examine the difference by group. The data showed some differences on social networking beliefs and practices between students and faculty.
Getting Out of the PROC PRINT Comfort Zone to Start Using PROC REPORT
Imelda C. Go and Abbas S. Tavakoli
PROC PRINT is one of the first things taught to a beginner SAS programmer because it provides an easy and simple way to view the records in a data set. The procedure is fast, simple, and straightforward. As one continues to learn about SAS, one finds out about other procedures such as PROC REPORT. This paper is written for the PROC PRINT user who has not, for whatever reason, ventured into PROC REPORT territory. The paper provides examples of PROC PRINT code and the corresponding PROC REPORT code that produces the same results. Examples of what PROC REPORT can produce that PROC PRINT cannot are also provided.
SAS Macros to Conduct Common Biostatistical Analyses and Generate Reports
Dana Nickleach, Yuan Liu, Adam Shrewsberry, Robert Steven Gerhard, Kenneth Ogan, Sungjin Kim and Zhibo Wang
Minimize your time spent on common biostatistical analyses by maximizing your use of these macros. Put them to work calculating statistics and producing high quality report tables summarizing your results in Word documents.
These macros are useful for conducting a complete analysis, from start to finish. Use them to 1) produce descriptive statistics, including frequencies and percentages for categorical variables; and n, mean, median, standard deviation, min, and max for quantitative variables; 2) produce parametric and non-parametric bivariate statistics with either quantitative or categorical
variables, including Chi-Square test, Fisher’s exact test, ANOVA, Kruskal-Wallis test, Pearson correlation coefficient, and Spearman rank correlation coefficient dependent on variable types; 3) look at the unadjusted associations of each variable with a binary or survival outcome, reporting odds ratios or hazard ratios, respectively; 4) conduct multiple regression using logistic regression or Cox proportional hazards models incorporating all variables or using a backward variable selection method. The capabilities of these macros and how to use them will be illustrated using data from a kidney stone questionnaire designed to examine the factors that influence patient preference for ureteroscopy vs. shock wave lithotripsy. The macros used for this study enabled production of comprehensive, professional looking reports, efficient communication and collaboration with investigators, and ensured that timely and high-quality service was delivered.
Let the Code Report the Running Time
A programmer/analyst may often find it necessary to know the exact start, stop, and elapsed time of a SAS code or a specific part of the code especially when it is executed though a scheduled job. SAS STIMER sometimes gives confusing information and manual calculations are still needed. This paper presents a tip on how to have the exact real running and elapsed time reported in the log. A macro utility is provided as an example of how this process can be streamlined and made flexible so that clear and customizable information can be reported.
Not just another macro
Y. Christina Song
In most cases, SAS macro facilitates repetitive iterations. Normally even with all the macro, the programmers still have to read specification carefully, to modify data, and to make macro calls etc. accordingly. To improve overall cost-efficiency substantially, here a procedure is introduced to let the specification do the SAS programing whenever possible: SAS will dynamically read specification, tweak the data, generate all macro calls, code the texts in between macros, and output data into desired reports, according to the specification. The method takes advantages of data driven and dynamic
features of SAS macro. The paper illustrates the key steps and outline of sample codes used in the procedure.
SAS Web Editor, is it the right choice for you?
Rebecca Ottesen and Jamelle Simmons
The SAS® Web Editor was released in SAS OnDemand for Academics as a way for students to learn to use SAS via a Web-based tool. The Web Editor is similar to traditional Windows based SAS with the added advantage that there is no
installation and it has platform independence. Being able to access SAS in this way is extremely important for classroom use where the type of machine varies across students. However differences exist in how the data, libraries and files are accessed and the Web Editor also requires an Internet connection. With the release of the Web Editor to the global SAS community in SAS 9.4, we will share our experience and the tradeoff between using SAS in a local installation versus a Web-based environment.
Does the Percentage of College Student and Military Personnel Group Quarters Affect Political Contributions per Zip Code? Visualization with PROC GMAP
A geographical analysis was performed using SAS v 9.3 to investigate the relationship between political contributions from a zip code and the percentage of college and military individuals within that zip code. For this analysis, data was merged from publicly available 2012 Federal Election Commission (FEC) contribution data and 2010 Census data. Additional variables were created using this combined data set. Other variables used in this model to help explain variance were the percentage of white, non-Hispanic or Latino individuals within a zip code, the percentage of contributions to the Democratic party from each zip code, and the U.S. Census District that the zip code falls in. Proc GMAP was used to help visualize the contributions per state as well as the percentage of college and military individuals to determine if there was a relationship between the two.
GLIMMIX_Rasch: A SAS® Macro for Fitting the Dichotomous Rasch Model
Yi-Hsin Chen, Isaac Li and Jeffrey Kromrey
For research areas in education, public health, psychology, sociology, etc., the Rasch measurement models provide a framework of design and analysis tools that differ from classical true score theory (CTST). General statistical software packages (e.g., SPSS or STATA) have been considered either not suitable or difficult to program for the purpose of implementing this complex approach. To obtain its more accurate, valid, and reliable measurement, researchers often resort to specialized Rasch computer programs at additional cost and have to spend time and efforts learning how to operate them. The latest version of SAS (SAS 9.3) procedure GLIMMIX can be employed to perform the Rasch measurement models easily. This article documents a SAS macro that fits the dichotomous Rasch model, estimates item and person parameters, and calculates unstandardized and standardized fit indices associated with this model. A simulation study was conducted to examine estimation accuracy and the extent of bias in the parameters and estimates yielded by this macro.
POSTEQUATE: A SAS® Macro for Conducting Non-IRT Test Post-equating
Isaac Li and Jeffrey Kromrey
Testing organizations publish multiple editions of an exam (test forms) for practical purposes. Equating studies are conducted to ensure score interchangeability between different test forms. This paper presents a SAS macro that conducts post-equating using five different methods. The macro reads in response data from the two test forms to be equated to and outputs a conversion table between raw scores on the new form and five sets of equivalent scores on the old form calculated using these methods. The paper provides a demonstration of the SAS code, an example of its application with sample data, and related output.
Analytical Approach for Bot Cheating Detection in a Massive Multiplayer Online Racing Game
Andrea Villanes Arellano
The videogame industry is a growing business in the world, with an annual growth rate that exceeded 16.7% for the period 2005 through 2008. Moreover, revenues from online games will account for more than 38% of total video game software revenues by 2013. Due to this, online games are vulnerable to illicit player activity that results in cheating. Cheating in online games could damage the reputation of the game when honest players realize that their peers are cheating, resulting in the loss of trust from honest players, and ultimately reducing revenue for the game producers. Analysis of game data is fundamental for understanding player behaviors and combating cheating in online games. In this work, we propose a data analysis methodology to detect cheating in massively multiplayer online (MMO) racing games. More specifically, our work focuses on bot detection. A bot controls a player automatically and is characterized by repetitive behavior. Players in a MMO racing game can use bots to play during the races using artificial intelligence favoring their odds to win, and automate the process of starting a new race upon finishing the last one. This results in a high number of races played with race duration showing low mean and low standard deviation, and time in between races showing consistent low median value. A study case is built on upon data from a MMO racing game, and our results indicate that our methodology successfully characterize suspicious player behavior.
Role of Fibrinogen, HDL Cholesterol and Cardio Respiratory Fitness in Predicting Mortality Due to Cardio-vascular Disease: Results From the Aerobics Center Longitudinal Study
Srinivasa Madhavan, Steven Blair and Abbas Tavakoli
Aim: The major aim of the study was to evaluate the associations of high density lipoprotein (HDL) cholesterol and plasma fibrinogen in determining the risk of mortality due to CVD. The secondary aim of the study was to examine the effect of cardiorespiratory fitness (CRF) on this relationship.
Methods: This is a cohort of predominantly Caucasian men and women of higher socioeconomic status (N=25,673) who visited the Cooper Clinic in or after 1990 and are a part of the Aerobics Center Longitudinal Study (ACLS). The main predictor variables were collected at baseline. The death registry was used for the follow-up of the outcome-mortality due to CVD. Survival analysis technique was used to assess the associations between the predictors and the outcome. All data analyses for research aims were performed using SAS statistical software, version 9.2.
Results: HDL was associated with CVD death (HR 0.985 [0.97-1.00]) but the association was only marginally significant p=0.054. The association between fibrinogen and CVD death was significant: HR 1.004 (1.001-1.007). With the addition of CRF, HDL had a protective effect (HR 0.989) on CVD death, which was not statistically significant (CI 0.974- 1.004) p= 0.155 and fibrinogen had an association with CVD death (HR 1.003), which also was not statistically significant (CI 0.999- 1.004) p = 0.101. CRF has a significant protective relationship with CVD death: HR 0.924 (CI 0.89-0.971).
Conclusion: This study concludes that fibrinogen plays a role in predicting death due to CVD in this adult Caucasian population of higher SES while HDL has a weak association in predicting the mortality due to CVD. Cardio-respiratory fitness provides a protective effect on death due to CVD. CRF may play a role mediating the fibrinogen-CVD death relationship.
Comparing PROC MI and IVEWare callable software
Bruno Vizcarra and Amang Sukasih
Multiple imputation has become a common technique to deal with missing data. Multiple imputation accounts for the variability involved with imputing data values. Advancement in technology, have allowed the development of a number of software and modules/packages that can perform multiple imputation such as in SAS, R, Stata, and IVEware. IVEware (Raghunathan 2001) is an imputation and variance estimation software that can be run in SAS (SAS callable). In this paper we will focus only on two software: SAS PROC MI and IVEware. SAS PROC MI provides imputation options including regression imputation and MCMC techniques; while the IVEware implements a sequential regression imputation technique. Though the basic modeling and prediction in SAS PROC MI and IVEware are comparable, the imputations are developed under different procedures. This paper will compare the two methods using a data set with variables of count nature having a poisson distribution. We will compare the first attempt imputation and discuss the limitations and issues encountered. We will then present a modified approach where first an indicator variable will be created and imputed then be used as an indicator to whether or not impute the counts.
Keywords: missing values, item nonresponse, missing at random, multiple imputation, sequential regression, MCMC
Winning the War on Terror with Waffles: Maximizing GINSIDE Efficiency for Blue Force Tracking Big Data
Troy Martin Hughes
The GINSIDE procedure represents the SAS solution for point-in-polygon determination. The procedure requires three parameters—a map data set representing the polygon, a test data set representing the point, and a list of ID fields that are attributed to an observation when point-in-polygon is determined. In the first part of this paper, by varying the type and quantity of data in test and map data sets as well as the number of ID fields, a regression model demonstrating runtime is demonstrated. The most significant factors prescribing longer GINSIDE runtimes are the number of observations and the number (and variety) of fields in the test data set. In the second part of the paper, methods are demonstrated that effectively reduce GINSIDE runtime by over 500 percent on sample data sets, by developing a waffle schema—or, simply, waffle—to overlay each polygon. Comprised of thousands of rectangles of increasingly smaller size, waffles are to geospatial polygons as cookies are to websites. The preformed waffles describe whether a rectangle is either wholly inside or wholly outside the polygon. Points falling in these “known” waffles do not require the GINSIDE procedure to establish polygon inclusion. In the third part of this paper, the test data are thrown to the wind and waffle iron is applied to the Department of Defense (DoD) Blue Force Tracking (BFT) database for Afghanistan, a classic big data multi-billion record database. Runtime efficiency gains of over 500 percent again demonstrate the benefit of this improved data agnostic GINSIDE hybrid methodology. As a side note, Appendix A details computational errors that were discovered in the SAS GINSIDE algorithm by the author in pursuit of this paper.
Stock Prices Analysis
This paper researches traditional stock market prices for last ten years. The goal of this paper is to analyze financial data for three major players in video gaming industry; namely Electronic Arts, Activision Blizzard, and Nintendo. The data involves three parameters that are closely tracked to reflect low and high percentage marks. Many analysts monitor this sort of data to figure out key performance indicators in their budget models. SAS Enterprise Guide conveniently provides the capability to import Microsoft excel workbooks and overcome the hassle of format conversion for multiple data types, especially date and European euro to American dollar conversion.
One very useful feature is the ability to wrap lengthy data into meaningful ranges for visualizing implicit patterns embedded within large datasets. For instance, thousands of observations1 that represent daily information can be organized into a quarterly basis helping to understand if seasonal trends are revealed over multiple years. Console games are generally shipped into market around holiday season for the masses to purchase. The hypothesis is to coordinate product launch with affiliated real world events, like coinciding the launch of Madden NFL with the Superbowl. In this particular scenario, analysts are interested in niche audiences such as immediate friends and family of real athletes. Their willingness to purchase latest and greatest at premium prices is driving criteria for this scenario analysis.
Reporting and Information Visualization
Analytical modeling and content analysis mapping with SAS
Analytical modeling when coupled with statistical categorization based on textual concepts that are measurable fares well with SAS software. Data analytics is a necessary element in adopting virtual systems by identifying Monte Carlo scenarios that may be validated in simulated environments. This presentation provides a demonstration of a statistical test in predictive analytics for virtualization adoption of infrastructure as a service with a platform in a cloud environment.
Not Enough Time To Catch Extreme Observations? Flag and Report with Macros and Arrays
Investigator to investigator variability within lab measurements can be an issue when all of the results are combined for analysis and publication, or if a patient’s diagnosis depends on the accuracy of human measurement. This paper examines methods for comparing multiple researcher measurements as compared to a “gold standard”. It also reports the individuals, and their measurements, that fall outside of set ranges of acceptability. As an example histology data was collected over three academic quarters where students were required to take multiple bone section measurements and calculate densities; their results were compared with the professor’s standards as an exercise to demonstrate investigator variability. This project combines SAS macros, arrays, and reporting methods to identify the individuals whose measurements fall outside of a set percentage of acceptability from the professor’s standards. The methods outlined in this paper serve as a way to check the accuracy of investigators and help management determine if intervention is needed when it is not possible to double check every investigators work.
Automating Visual Data Mining Using Bihistograms, the SAS Annotate Facility and SAS Macros
The first step in building a predictive model is to understand the available data. There are two main components to this process. This first, usually called vetting, consists of looking at the minimums, means, medians, maximums, standard deviations and percents missing for all numeric variables. Variables with excessive missing values or constant values can be eliminated from consideration as predictors. This first phase of data understanding shows what data can be used.
The second, and more difficult phase, has to do with identifying which variables are discriminatory with respect to the outcome being predicted. Techniques to accomplish this range from the simple such as considering the difference between the means of a variable when partitioned by a binary dependent variable, to the more sophisticated such as computing a Kolmolgorov-Smirnoff statistic on two populations as defined by a dichotomous outcome.
A not well-enough-known graphical approach to this second task uses what is known as a bihistogram. This visualization plots side-by-side histograms of the variable under scrutiny with one side associated with the positive dependent outcome and the other side with the negative outcome. This approach graphically shows all the differing elements between two distributions that the Kolmolgorov-Smirnoff tries to capture in one number. The bihistogram tells you, in one glance, whether differences in means, variances or other factors cause the two distributions to be significantly different.
In this paper we describe how to use the SAS Annotate data set to produce bihistograms that illustrate differences in distributions defined by a dichotomous outcome. Missing values and the relative frequency of the dependent variable are also included in the graphic. Macro code to automate the generation of bihistograms for all candidate numeric predictors in a data set is also provided.
Seven Steps to a SAS EBI Proof-of-Concept Project
The Purchasing Department is considering contracting with your team for a new SAS EBI application. He's already met with SAS and seen the sales pitch, and he is very interested. But the manager is a tightwad and not sure about spending the money. Also, he wants his team to be the primary developers for this new application. Before investing his money on training, programming, and support, he would like a proof-of-concept. This paper will walk you through the 7 steps to create a SAS EBI POC project:
Remember, your goal is not to launch a full-blown application. Instead, we’ll strive towards helping them see the potential in your organization for applying this methodology.
- Develop a kick-off meeting including a full demo of the SAS EBI tools.
- Set up your Unix filesystems and security.
- Set up your SAS metadata ACTs, users, groups, folders, and libraries.
- Make sure the necessary SAS client tools are installed on the developers’ machines.
- Hold a SAS EBI workshop to introduce them to the basics, including SAS Enterprise Guide, SAS Stored Processes, SAS Information Maps, SAS Web Report Studio, the SAS Information Delivery Portal, and the SAS Add-In for Microsoft Office, along with supporting documentation.
- Work with them to develop a simple project, one that highlights the benefits of SAS EBI and shows several methods for achieving the desired results.
- Last but not least, follow-up!
Seamless Dynamic Web (and Smart Device!) Reporting with SAS®
The SAS® Business Intelligence platform provides a wide variety of reporting interfaces and capabilities through a suite of bundled components. SAS® Enterprise Guide®, SAS® Web Report Studio, SAS® Add-In for Microsoft Office, and SAS® Information Delivery Portal all provide a means to help organizations create and deliver sophisticated analysis to their information consumers. However businesses often struggle with the ability to easily and efficiently create and deploy these reports to the web and smart devices. If it is done, it is usually at the expense of giving up dynamic ad-hoc reporting capabilities in return for static output or possibly limited parameter-driven customization.
The obstacles facing organizations that prevent them from delivering robust ad-hoc reporting capabilities on the web are numerous. More often than not, it is due to the lack of IT resources and/or project budget. Other failures may be attributed to breakdowns during the reporting requirements development process. If the business unit(s) and the developers cannot come to a consensus on report layout, critical calculations, or even what specific data points should make up the report, projects will often come to a grinding halt.
This paper will discuss a solution that enables organizations to quickly and efficiently produce SAS reports on the web and your mobile device - in less than 10 minutes! It will also show that by providing self-service functionality to the end users, most of the reporting requirement development process can be eliminated, thus accelerating production-ready reports and reducing overall maintenance costs of the application. Finally, this paper will also explore how the other tools on the SAS Business Intelligence platform can be leveraged within an organization.
Instant Disaggregation: Using the macro language to provide reports with parallel structure across different subsets of the data set.
Effective disaggregation of data is a primary tool in root cause analysis. SAS provides methods for rotating through a variety of filters that produce reports with the same structure applied to different subsets of the data. The parallel structure of the reports allows report consumers to quickly contrast the different subsets. When the macro language is combined with by group processing you can create easily navigated PDFs that allow report consumers to have reports customized by location and disaggregated by subgroups.
Coding to prepare for generating this style of report is not complicated. You create a control table whose columns contain groupings used to separate the different reports and represent the different sub-reports to be generated. The macro language allows you to read the information in the control data set and use those stored values as filters in the report code.
Case Study: Migrating an Existing SAS Process to Run on the SAS Intelligence Platform
In 2012, the Wellpoint Medicaid Business Unit (formerly Amerigroup Corporation) installed components of the SAS Intelligence Platform and moved away from what was primarily a desktop SAS environment. We sought to take advantage of the enhanced capabilities of the SAS Intelligence Platform and also enable new technologies such as SAS Enterprise Miner. However, this meant migrating a significant SAS application, the Chronic Illness Intensity Index (CI3), to run in Enterprise Guide and the SAS Add-In for Microsoft Office. The CI3 reports present the results of the Wellpoint Medicaid Business Unit Continuous Case-Finding prioritization process and improve efficiency by providing relevant clinical information about members to drive medical management activities. This paper describes the lessons learned and points out some best practices for updating a legacy SAS application to run in a new environment.
Mobile Reporting at University of Central Florida
Mobile devices are taking over conventional ways of sharing and presenting
information in today’s business and working environments. Accessibility to this information is a key factor for companies and institutions in order to
reach wider audiences
more efficiently in these days.
SAS already provides a powerful set of tools that allow developers to fulfill
this increasing demand in mobile reporting without needing upgrading to the
latest version of the platform.
Here at University of Central Florida, using SAS 9.2 EBI environment, we were
able to create reports targeting our iPad consumers at our executive level in
order to provide them with that relevant data they need for decision making
Our goal was to provide them with reports that fit in one screen in order to
avoid the need of scrolling, and that are easily exportable to PDF. These
capabilities were well received by our users.
This paper will present techniques we used in order to create these ‘mobile
Experiences in Using Academic Data for BI Dashboard Development
Evangeline Collado and Michelle Parente
Business Intelligence (BI) dashboards serve as an invaluable, high level,
visual reference tool for decision making processes in many business
industries. A request was made to our department to develop some BI dashboards
that could be incorporated in an academic setting. These dashboards would aim
to serve various undergraduate executive and administrative staff at the
university. While most business data may lend itself to work very well and
easily in the development of dashboards, academic data is typically modeled
differently and, therefore, faces unique challenges. In the following paper,
the authors will detail and share the design and development process of
creating dashboards for decision making in an academic environment utilizing
SAS BI Dashboard 4.2 and other SAS Enterprise Business Intelligence (EBI) 9.2
tools. The authors will also provide lessons learned as well as recommendations
for future implementations of BI dashboards utilizing academic data.
Uncovering Patterns in Textual Data with SAS Visual Analytics and SAS Text Analytics
SAS Visual Analytics is a powerful tool for exploring big data to uncover
patterns and opportunities hidden with your data. The challenge with big data
is that the majority is unstructured data, in the form of customer feedback,
survey responses, social media conversation, blogs and news articles. By
integrating SAS Visual Analytics with SAS Text Analytics, customers can uncover
patterns in big data, while enriching and visualizing your data with customer
sentiment, categorical flags, and uncovering root causes that primarily exist
within unstructured data.
This paper highlights a case study that provides greater insight into big data,
demonstrates advanced visualization, while enhancing time to value by
leveraging SAS Visual Analytics high-performance, in-memory technology,
Hadoop, and SAS’ advanced Text Analytics capabilities.
How to Replicate Excel Stacked Area Graphs in SAS
Have you ever been given a graph and asked to reproduce the same thing using
SAS? As you stare at the graph you realize you have no idea even what to call
this type of graph to even begin to try to figure out how to program it. Your
helpful SAS programming coworkers have no idea but suggest you try looking for
some kind of stacked bar chart. As you start desperately looking through all
types of SAS documentation and conference papers you finally realize the
solution may reside within PROC GPLOT. This paper will walk you through the
steps to eliminate the original panic of determining the graph type and then
walk you through the steps of creating a very close replica of an Excel Stacked
Creating ZIP Code-Level Maps with SAS®
SAS®, SAS/GRAPH®, and ODS graphics provide SAS programmers with the tools to
create professional and colorful maps. Provided with SAS/GRAPH are boundary
files for U.S. states and territories, as well as internal boundaries at the
county level. While much data and results can be displayed at this level, often
a higher degree of granularity is needed. The U.S. Census Bureau provides ZIP
code boundary files in ESRI shapefile format (.shp) by state for free download
and import into SAS using SAS PROC DATAIMPORT. This paper illustrates the use
of these ZIP code tabulation area (ZCTA) files with SAS to map data at a ZIP
code level. Example maps include choropleth, distance, and heat maps.
Examples included in this paper were created with version 9.2 of SAS on a
Windows 64-bit server platform and use Base SAS, SAS/STAT and SAS/GRAPH. The
techniques represented in this paper require a SAS/GRAPH license but are not
platform-specific and can be adapted by beginning through advanced SAS users.
A Map is Just a Graph Without Axes
SAS’® PROC GMAP can produce a variety of maps of varying complexity
go beyond the basic capabilities of GMAP, it is necessary to use the Annotate
Facility in order to add additional information such as symbols in specific
places. Furthermore, there are times that the desired map is simply a sketch
of geographically related measurements that need to be displayed in a
simulated, not to scale, map. A map is simply a collection of coordinates that
are plotted but for which no X/Y axis system is typically shown (although items
such as road atlases or military maps or charts may have a grid and axes to
help locate specific points of reference). By remembering this, one can
sometimes create an embellished map using PROC GPLOT without having to create
an Annotate data set. Furthermore, by using GPLOT with the axes, one can locate
invalid map coordinates in user created map files. Finally, an example of
creating a plotted outline map with dots showing environmental variables using
Annotate and Proc Gplot is offered. Annotate is used in the latter case since
it was necessary to dynamically scale the dots that represent the location and
magnitude of the plotted values.
"Google-like" Maps in SAS
We are frequently asked if we can have maps similar to Google Maps in SAS.
Customers want the background image displayed behind their data so they can see
where streets or other features are located. They may also want to pan and zoom
the map. Unfortunately, Google has legal restrictions and limitations on the
use of their maps. Now, you can have “Google-like” maps inside of SAS.
You may have already seen this capability in products like Visual Analytic
Explorer (VAE), and other products using them will be available in future
releases. This presentation will discuss and demonstrate these new
capabilities in VAE, SAS/GRAPH and other products.
SAS Macros to Produce Publication-ready Tables from SAS Survey Procedures
Emma Frazier, Shuyan Zhang and Ping Huang
SAS Macros to Produce Publication-ready Tables from SAS Survey Procedures
Emma Frazier, Centers for Disease Control, Atlanta, Georgia
Shuyan Zhang and Ping Huang, ICF International, Atlanta, Georgia
To analyze complex survey data, analysts must have the knowledge of weights and
design variables required to complete the analysis. SAS Survey procedures are
used for the analysis of this type of data, but the output can be challenging
for the production of quality tables. We developed SAS code that uses the
features of the SAS ODS and Proc Report to generate publication-ready documents
from SAS Survey procedures to complement data analysis for end users. We
present relatively straightforward SAS code that will generate rich text format
tables for cross tabulations with statistics such as weighted percentages, 95%
confidence intervals, coefficient of variation, and Rao-Scott Chi-Square
Some advanced options and techniques in PROC REPORT are presented to
demonstrate the flexibility of customizing the output style. The program has a
simple interface with the capability to create complex table formats. These
procedures are valuable for researchers who need to produce tables for analysis
and can be easily modified for various tables.
Hospital Readmissions: Characteristics of readmits within 30 days and beyond 30 days
Daniel Clark, Pradeep Podila, Edward Rafalski and George Relyea
Hospitalizations, overall, account for approximately 31 percent of total health
care expenditures. Moreover, statistics show that 18 percent of Medicare
patients discharged from hospitals are readmitted within 30 days of discharge,
costing about $15 billion. The readmission within 30 days has been mandated by
the Centers for Medicare and Medicaid Services (CMS) as a “quality of care
measure” in order to put a check to the skyrocketing costs as they are
levying hefty burden on the hospitals and the government.
According to Section 3025 of the Affordable Care Act (2010), Congress directed
the Centers for Medicare and Medicaid Services (CMS) to reduce its payments to
hospitals with high readmission rates and to penalize those hospitals with
worse than expected 30-day hospital readmission rates. In other words, this is
also known as the CMS Hospital Readmission Reduction Program. But, according to
Harvard Physician, Dr. Ashish Jha over the last 3 years evidence has come into
the forefront which supports that in most standards the readmission metric
fails as a “quality of care measure”.
So, the primary aim of the project is to explore if the patients who were not
readmitted within 30 days are better off in terms of health than those who were
readmitted within 30 days. The financial burden of the patients (pharmaceutical
costs) in both these groups along with the co-morbid conditions leading to
readmissions will be assessed.
SAS Information Delivery System (IDS) and recurrent event analysis will be
utilized to conduct the analysis on 351,908 (All Payor) and 134,272 (Medicare)
discharges data from: January 2007 – December 2012. The data set contains 60
different variables such as- Length of Stay (LOS), Number of Medications, Home
Area - Zip Code, Number of Diagnosis Codes, Primary and Secondary Diagnosis
Codes, Procedure Codes, Family Support, Home-Phone Number, Education Level,
Employment Status, Marital Status etc.. This model will help MLH in designing
transition care strategies based on health conditions and ultimately working
towards reducing the influx of potential readmits and unplanned readmissions.
Statistics and Data Analysis
PROC SURVEYSELECT as a Tool for Drawing Random Samples
This paper illustrates many of the sampling algorithms built into PROC
SURVEYSELECT, particularly those pertinent to complex surveys, such as
systematic, probability proportional to size (PPS), stratified, and cluster
sampling. The primary objectives of the paper are to provide background on why
these techniques are used in practice and to demonstrate their application via
syntax examples. Hence, this is not a how-to paper on designing a
statistically efficient sample—there are entire textbooks devoted to that
subject. One exception is that the paper will discuss a few recently
incorporated sample allocation strategies—specifically, proportional, Neyman,
and optimal allocation. The paper concludes with a few examples demonstrating
how one can use PROC SURVEYSELECT to handle certain frequently-encountered
sample design issues such as alternative sampling methods across strata and
multi-stage cluster sampling.
A SAS Macro for Finding Optimal k-Means Clustering in One Dimension with Size Constraints
Fengjiao Hu and Robert Johnson
Wang and Song (2011) proposed a k-means clustering algorithm in one dimension
using exact dynamic programming which guarantees optimality. Their algorithm
solved the clustering problem by breaking it into smaller nested problems. The
one dimensional measure may, for example, be baseline measures related to a
before-after study and subjects are grouped (clustered) on baseline before
randomization. In this paper we extend their work by placing constraints on the
cluster size, for example, each cluster must be no less than the number of
study arms. A SAS macro will be presented which finds the optimal clustering
given the constraint by minimizing the within cluster root mean squared error.
An option which randomly allocates subjects to study arms is also included. An
example will be given where a sample of primary care practices are to be
allocated to treatment or control. The study measures the degree to which
primary care physicians deliver smoking cessation counseling. Prior to
randomization, the practices are clustered on the baseline measure.
SAS® Macros CORR_P and TANGO: Interval Estimation for the Difference between Correlated Proportions in Dependent Samples
Pei-Chen Wu, Patricia Rodriguez de Gil, Thanh Pham, Diep Nguyen, Jeanine Romano, Jeffrey D. Kromrey and Eun Sook Kim
The two proportions from the same sample of observations or from matched-pair
samples are correlated. A number of studies proposed interval estimation for
the difference in correlated proportions (e.g., Bonett & Price, 2011; Newcombe,
1998; Tango, 1998). Considering that confidence intervals (CI) are more
informative than point estimates but the CI for the difference in correlated
proportions is not readily available in SAS, the purpose of this paper is to
provide a SAS macro for three types of confidence intervals suggested in the
literature: Wald CI, adjusted Wald CI, and approximate CI proposed by Tango.
The results from a simulation study comparing these three confidence intervals
are also presented.
Using Predetermined Factor Structures to Simulate a Variety of Data Conditions
Kevin Coughlin, Jeffrey Kromrey and Susan Hibbard
This paper presents a method through which data sets of varying characteristics
can be simulated based on predetermined, uncorrelated factor structures. As
demonstrated in a series of studies, this Monte Carlo method yields factor
structures that are clear and simple. The process begins with the application
of conceptual and actual factor loadings to the creation of correlation
matrices; samples are then generated based on these correlation matrices. The
method for generating correlation matrices allows the researcher to manipulate
the number of observed variables, the communality among variables, and the
number of common factors. The process for simulating samples of observations
provides additional options for specifying sample size and level of
measurement. This paper includes an example of the process for generating a
correlation matrix and a distribution of simulated observations. This paper is
intended for researchers that are interested in factor analytic designs and are
familiar with PROC IML.
Forecasting Enrollment in Higher Education using SAS Forecast Studio
Erik Bowe and Steven Merritt
This session will discuss the data sources and predictor variables that have
enabled Kennesaw State University (KSU), a comprehensive metropolitan
university, to forecast its enrollment and degrees conferred with a fair amount
of accuracy. The presentation will include both internal and external sources
of data KSU utilized as well as which predictor variables had statistical
significance for tracking future growth. In addition, the factors affecting
KSU’s decision to use an ARIMA forecasting model will be outlined, along with
a review and critique of alternate ratio-based enrollment-forecasting
Analyzing Multiway Models with ANOM Slicing
Multiway (multifactor) models with significant interaction can be analyzed
using simple effect comparisons. These F-tests are multiple comparisons which
are referred to as slice tests (e.g., in a two factor study one slices by
factor A by comparing the levels of factor B for each level of A). Slicing uses
the full model degrees of freedom and MSE. This paper shows how to use
Analysis of Means (ANOM) methods in SAS and from the multiple comparisons
platform in JMP11 to create ANOM decision charts for each of the slice values. These ANOM charts tell more about the relationship among the factor levels than the F tests.
Maximizing Confidence and Coverage for a Nonparametric Upper Tolerance Limit for a Fixed Number of Samples
A nonparametric upper tolerance limit (UTL) bounds a specified percentage of
the population distribution with specified confidence. The most common UTL is
based on the largest order statistic (the maximum) where the number of samples
required for a given confidence and coverage is easily derived for an
infinitely large population. This relationship can be used to determine the
number of samples prior to sampling to achieve a given confidence and coverage.
However, often statisticians are given a data set and asked to calculate a UTL
for the number of samples provided. Since the number of samples usually cannot
be increased to increase confidence or coverage for the UTL, the maximum
confidence and coverage for the given number of samples is desired. This paper
derives the maximum confidence and coverage for a fixed number of samples. This
relationship is demonstrated both graphically and in tabular form. The maximum
confidence and coverage are calculated for several sample sizes using results
from the maximization. This paper is for intermediate SAS® users of Base SAS
who understand statistical intervals.
Dealing with Missing Data for Credit Scoring
How well can statistical adjustments account for missing data in the
development of credit scores? We will demonstrate how credit scores are
developed and why missing data is a problem especially in a waterfall decision
environment. We will then compare three approaches to the missing data. One
model will only use the complete cases. One model will use all of the data but
place missing data into its own bucket. The final model will demonstrate the
use of multiple imputation to account for the uncertainty created by missing
Evaluating the Accuracy Assessment Methods of a Thematic Raster through SAS® Resampling Techniques and GTL Visualizations
The Cropland Data Layer (CDL) is a thematic raster layer of agricultural crops
and other categories derived from satellite imagery and other data layers and
trained on ground reference data. The CDL has been an annual product for the
48 contiguous states since 2008. The current accuracy assessment uses all data
points/pixels from the interior of the validation fields. This approach
introduces extensive spatial autocorrelation and ignores the 40 percent of the
data points that fall in field edges. Field boundary pixels have been ignored
in the past due to locational uncertainties. This project begins with the six
million data points available for assessment from the 2012 Michigan CDL and
applies resampling techniques from SAS®/SurveySelect procedure to address
issues of varying field sizes, spatial autocorrelation and pixel location in
the edge versus the interior of the fields. The results are summarized in
customized Graphic Template Language (GTL) charts and alternatives to the
current assessment methodology are discussed.
Evaluating the Performance of the SAS® GLIMMIX Procedure for the Dichotomous Rasch model: A Simulation Study
Isaac Li, Yi-Hsin Chen and Jeffrey Kromrey
A simulation study was devised to evaluate the accuracy and precision of the
GLIMMIX procedure when fitting the dichotomous Rasch model. The evaluation
reviewed technicalities including item parameter recovery, standard error
estimates, unstandardized and standardized fit indices produced by GLIMMIX. Factors manipulated in this study were test length (10, 20, 40, 60, and 80) and
sample size (100, 200, 400, 600, 800, 1000, 1500, and 2000). The item
difficulties were generated from a normal distribution with the mean of 0 and
the standard deviation of 1. The generated item difficulties were symmetrically
distributed ranging from -3 to 3. The person ability parameters were also
generated from a normal distribution with the mean and standard deviation of 0
and 1, respectively. The following statistics were applied to evaluate the
performance of the GLIMMIX procedure: bias, sampling variance of the estimates,
average error variance, and descriptive statistics (mean, variance, minimum,
and maximum) for INFIT and OUTFIT and standardized INFIT and OUTFIT. The
results indicated that SAS GLIMMIX procedure for the dichotomous Rasch model
provided biased estimates for smaller sample size and shorter tests. To utilize
this analytical tool, applying it to tests longer than 20 items and samples
greater than 200 persons is recommended.
GEN_OMEGA2: A SAS® Macro for Computing the Generalized Omega-Squared Effect Size Associated with Analysis of Variance Models
Anh P. Kellermann, Jeanine Romano, Patricia Rodríguez de Gil, Than Pham, Patrice Rasmussen, Yi-Hsin Chen and Jeffrey D. Kromrey
Effect sizes are strongly encouraged to be reported in addition to statistical
significance and should be considered in evaluating the results of a study. The
choice of an effect size for ANOVA models can be confusing because indices may
differ depending on the research design as well as the magnitude of the effect. Olejnik and Algina (2003) proposed the generalized eta-squared and
omega-squared effect sizes which are comparable across a wide variety of
research designs. This paper provides a SAS macro for computing the
generalized omega-squared effect size associated with analysis of variance
models by utilizing data from PROC GLM ODS tables. The paper provides the macro
programming language, as well as results from an executed example of the macro.
Area under a Receiving Operating Characteristic (ROC) Curve: comparing parametric estimation, Monte Carlo simulation and numerical integration
A receiver operating characteristic (ROC) curve is a plot of predictive model
probabilities of true positives (sensitivity) as a function of probabilities of
false positives (1 – specificity) for a set of possible cutoff points. Some
of the SAS/STAT procedures do not have built-in options for ROC curves and
there have been a few suggestions in previous SAS forums to address the issue
by using either parametric or non-parametric methods to construct those curves.
This study shows how a simple concave function verifies the properties
necessary to provide good fits to the ROC curve of diverse predictive models,
and has also the advantage of giving an exact solution for the estimation of
the area under the ROC in terms of a Beta distribution – defined by the
parameters of the function. The study proceeds to discuss the implementation of
the approach to a working dataset, and compares the value of the area estimated
using the parametric solution to the ones obtained through Monte Carlo
simulation and three numerical methods of integration – Trapezoidal Rule,
Simpson’s Rule and Gauss-Legendre Quadrature. The SAS products used in the
study are SAS BASE, SAS/STAT and SAS GRAPH.
You like What? Creating a Recommender System with SAS
Recommendation engines provide the ability to make automatic predictions
(filtering) about the interests of a user by collecting preference information
from many users (collaborating). SAS provides a number of techniques and
algorithms for creating a recommendation system ranging from basic distance
measures to matrix factorization and collaborative filtering. The “wisdom of
crowds” suggests that communities make better decisions than a handful of
individuals, and as a community grows the better decisions it makes. With
enough data on individual community participation, we can make predictions
about what an individual will like in the future based on their likes and
dislikes have been in the past.
An Intermediate Primer to Estimating Linear Multilevel Models using SAS® PROC MIXED
Bethany Bell, Whitney Smiley, Mihaela Ene, Phillip Sherlock, Jr. and Genine Blue
This paper expands upon Bell et al.’s (2013) SAS Global Forum presentation
“A Multilevel Model Primer Using SAS® PROC MIXED” in which the authors
presented an overview of estimating two and three-level linear models via PROC
MIXED. However, in their paper, the authors, for the most part, relied on
simple options available in PROC MIXED. In this paper, we present a more
advanced look at common PROC MIXED options used in the analysis of social and
behavioral science data, as well as the use of two different SAS macros
previously developed for use with PROC MIXED to examine model fit (MIXED_FIT;
Ene, Smiley, & Bell, 2012) and distributional assumptions (MIXED_DX; Bell et
al., 2010). Specific statistical options presented in the current paper
include (a) PROC MIXED statement options for estimating statistical
significance of variance estimates (COVTEST, including problems with using this
option) and estimation methods (METHOD =), (b) MODEL statement option for
degrees of freedom estimation (DDFM =), and (c) RANDOM statement option for
specifying the variance/covariance structure to be used (TYPE =). Given the
importance of examining model fit, of both fixed and random effects, we also
present methods for estimating changes in model fit, for both nested and
non-nested models, through a discussion and illustration of the SAS macro
MIXED_FIT. Likewise, the SAS macro MIXED_DX is used to show users how to
examine distributional assumptions associated with two-level linear models,
including normality and homogeneity of level-1 and level-2 residuals. To
maintain continuity with the Bell et al. (2013) introductory PROC MIXED paper,
thus, providing users with a set of comprehensive guides for estimating
multilevel models using PROC MIXED, we use the same real world data sources,
including the publicly available Early Childhood Longitudinal
Study-Kindergarten cohort data as Bell et al. (2013) used.
Modeling Categorical Response Data
Logistic regression, generally used to model dichotomous response data, is one
of the basic tools for a statistician. But what do you do when maximum
likelihood estimation fails or your sample sizes are questionable? What happens
when you have more than two response levels? And how do you handle counts?
This tutorial briefly reviews logistic regression for dichotomous responses,
and then illustrates alternative strategies for the dichotomous case and
additional strategies such as the proportional odds model, generalized logit
model, conditional logistic regression, and Poisson regression. The
presentation is based on the third edition of the book Categorical Data
Analysis Using the SAS System by Stokes, Davis and Koch (2012). An existing working knowledge of logistic regression is required for this
tutorial to be fully beneficial.