Friday, 2 October 2015

Some more examples...

C h a p t e r 16Summarizing Your Data

SAS has a number of procedures to help you report your data in the form you would like to see it presented. In deciding which procedure is most appropriate for your needs, you must consider a number of factors. These factors include the following.
 ● What is the purpose of the summary?
 ● What information do you need in the report?
 ● Do you need a specific layout or customized information?
 ● Do you need cross-tabulations or hierarchical groupings?
 ● Do you need statistics?

Solved Problems

*16-1

*Using the SAS data set College, compute the mean, median, minimum, and
maximum and the number of both missing and non-missing values for the variables
ClassRank and GPA. Report the statistics to two decimal places;

*Data set COLLEGE;

proc format ;
   value $yesno 'Y','1' = 'Yes'
                'N','0' = 'No'
                ' '     = 'Not Given';
   value $size 'S' = 'Small'
               'M' = 'Medium'
               'L' = 'Large'
                ' ' = 'Missing';
   value $gender 'F' = 'Female'
                 'M' = 'Male'
                 ' ' = 'Not Given';
run;

data college;
   length StudentID $ 5 Gender SchoolSize $ 1;
   do i = 1 to 100;
      StudentID = put(round(ranuni(123456)*10000),z5.);
      if ranuni(0) lt .4 then Gender = 'M';
      else Gender = 'F';
      if ranuni(0) lt .3 then SchoolSize = 'S';
      else if ranuni(0) lt .7 then SchoolSize = 'M';
      else SchoolSize = 'L';
      if ranuni(0) lt .2 then Scholarship = 'Y';
      else Scholarship = 'N';
      GPA = round(rannor(0)*.5 + 3.5,.01);
      if GPA gt 4 then GPA = 4;
      ClassRank = int(ranuni(0)*60 + 41);
      if ranuni(0) lt .1 then call missing(ClassRank);
      if ranuni(0) lt .05 then call missing(SchoolSize);
      if ranuni(0) lt .05 then call missing(GPA);
      output;
   end;
   format Gender $gender1.
          SchoolSize $size.
          Scholarship $yesno.;
   drop i;
run;



title "Statistics on the College Data Set";
proc means data= college        
n          
nmiss          
mean          
median          
min          
max          
maxdec=2;

var ClassRank GPA;
run;





*Q16-3 

Using the SAS data set College, report the mean and median GPA and ClassRank
broken down by school size (SchoolSize). Do this twice, once using a BY statement,
and once using a CLASS statement.;


proc sort data=college out=college1;  
by SchoolSize;
run;
title "Statistics on the College Data Set - Using BY";
title2 "Broken down by School Size";
proc means data=college1          
n          
mean          
median          
min          
max          
maxdec=2;  
by SchoolSize;  
var ClassRank GPA;
run;
title "Statistics on the College Data Set - Using CLASS";
title2 "Broken down by School Size";
proc means data=college          
n          
mean          
median          
min          
max          
maxdec=2;  
class SchoolSize;  
var ClassRank GPA;
run;




C h a p t e r 17
Counting Frequencies

PROC FREQ can use either raw data or cell count data to produce frequency and cross tabulation tables. Raw data, also known as case-record data, report the data as one record for each subject or sample member. Cell count data report the data as a table, listing all possible combinations of data values along with the frequency counts. 

Solved Problems

*Q17-3

Using the data set Blood, produce frequencies for the variable Chol (cholesterol).
Use a format to group the frequencies into three groups: low to 200 (normal), 201
and higher (high), and missing. Run PROC FREQ twice, once using the MISSING
option, and once without. Compare the percentages in both listings.;

*Data set BLOOD;
data blood;
   infile '/folders/myfolders/Practice/blood.txt' truncover;
   length Gender $ 6 BloodType $ 2 AgeGroup $ 5;
   input Subject
         Gender
         BloodType
         AgeGroup
         WBC
         RBC
         Chol;
   label Gender = "Gender"
         BloodType = "Blood Type"
         AgeGroup = "Age Group"
         Chol = "Cholesterol";
run;



proc format;  
value cholgrp low-200  = 'Normal'                
 201-high = 'High'                
 .        = 'Missing';
run;
title "Demonstrating the MISSING Option";
title2 "Without MISSING Option";
proc freq data=blood;  
tables Chol / nocum;  
format Chol cholgrp.;
run;
title "Demonstrating the MISSING Option";
title2 "With MISSING Option";
proc freq data=blood;  
tables Chol / nocum missing;  
format Chol cholgrp.;
run;




*Q17-5

Using the SAS data set College, create a two-way table of Scholarship (rows) by
ClassRank (columns). Use a user-defined format to group class rank into two groups:
70 and lower, and 71 and higher. (Please see the note in Chapter 16, Problem 2,
about the permanent formats used in this data set.);



proc format;  
value rank low-70  = 'Low to 70'            
71-high = '71 and higher';
run;
title "Scholarship by Class Rank";
proc freq data=college;  
tables Scholarship*ClassRank;  
format ClassRank rank.;
run;

Friday, 18 September 2015

First hand experience on SAS

So let’s start...this is my first post so feeling a bit excited.There are some questions which couldn't solve ,so kindly do suggest me a way out for those questions.


Conditional Formats and Labels


*Question1

Run the program here to create a temporary SAS data set called Voter:
data voter;
input Age Party : $1. (Ques1-Ques4)($1. + 1);
datalines;
23 D 1 1 2 2
45 R 5 5 4 1
67 D 2 4 3 3
39 R 4 4 4 4
19 D 2 1 2 1
75 D 3 3 2 3
57 R 4 3 4 4
;
Add formats for Age (0–30, 31–50, 51–70, 71+), Party (D = Democrat, R =
Republican), and Ques1–Ques4 (1=Strongly Disagree, 2=Disagree, 3=No
Opinion, 4=Agree, 5=Strongly Agree). In addition, label Ques1–Ques4 as
follows:
Ques1 The president is doing a good job
Ques2 Congress is doing a good job
Ques3 Taxes are too high
Ques4 Government should cut spending
;
Ans
proc format;
value umar 0-30 = 'less than 30'
30-50 = '30 to 50'
50-70 = '50 to 70'
70-high = 'above 70';
value $politics 'D' = 'Democrat'
'R' = 'Republican';
value $likert '1' = 'Strongly Disagree'
 '2'='Disagree'
 '3'='No Opinion' 
 '4'='Agree' 
 '5'='Strongly Agree';
 run;
data voter;
infile datalines;
input Age Party$ (Ques1-Ques4)($1.+1);
datalines;
23 D 1 1 2 2
45 R 5 5 4 1
67 D 2 4 3 3
39 R 4 4 4 4
19 D 2 1 2 1
75 D 3 3 2 3
57 R 4 3 4 4
;
run;
PROC PRINT noobs;
RUN;
proc freq;
format Age umar.
Party $politics.
Ques1-Ques4 $likert.;
label Ques1 = 'The president is doing a good job'
Ques2 = 'Congress is doing a good job'
Ques3 = 'Taxes are too high'
Ques4 = 'Government should cut spending';
run;



*Question2

You want to see frequencies for Questions 1 to 4 from the previous question.
However, you want only three categories: Generally Disagree (combine
Strongly Disagree and Disagree), No Opinion, and Generally Agree
(combine Agree and Strongly Agree). Accomplish this using a new format for
Ques1–Ques4.;
Ans
proc format;
value umar 0-<30 = 'less than 30'
30-<50 = '30 to 50'
50-<70 = '50 to 70'
70-high = 'above 70';
value $politics 'D' = 'Democrat'
'R' = 'Republican';
value $likert '1','2'='Generally Disagree'
 '3'='No Opinion'
 '4','5'='Generally Agree';
 run;
data voter;
infile datalines;
input Age Party$ (Ques1-Ques4)($1.+1);
datalines;
23 D 1 1 2 2
45 R 5 5 4 1
67 D 2 4 3 3
39 R 4 4 4 4
19 D 2 1 2 1
75 D 3 3 2 3
57 R 4 3 4 4
;
run;
PROC PRINT noobs;
RUN;
proc freq;
format Age umar.
Party $politics.
Ques1-Ques4 $likert.;
label Ques1 = 'The president is doing a good job'
Ques2 = 'Congress is doing a good job'
Ques3 = 'Taxes are too high'
Ques4 = 'Government should cut spending';
run;



*Question 3

Run the following program to create a SAS data set called Colors (see Chapter 21 for
a discussion of the double at signs [@@] in the INPUT statement):
data colors;
input Color : $1. @@;
datalines;
R R B G Y Y . . B G R B G Y P O O V V B
;
Ans
proc format;
value $col 'R','B','G' = 'Group 1'
'Y','O' = 'Group 2'
'.' = 'Not Given'
other = 'Group 3'
run;
data colors;
input Color : $1. @@;
datalines;
R R B G Y Y . . B G R B G Y P O O V V B
;run;
proc freq;
format Color $col.;
run;
 

Performing Conditional Processing

*Question1

Run the program here to create a temporary SAS data set called School:
data school;
input Age Quiz : $1. Midterm Final;
/* Add you statements here */
datalines;
12 A 92 95
12 B 88 88
13 C 78 75
13 A 92 93
12 F 55 62
13 B 88 82
;
Using IF and ELSE IF statements, compute two new variables as follows: Grade
(numeric), with a value of 6 if Age is 12 and a value of 8 if Age is 13.
The quiz grades have numerical equivalents as follows: A = 95, B = 85, C = 75,
D = 70, and F = 65. Using this information, compute a course grade (Course) as a
weighted average of the Quiz (20%), Midterm (30%) and Final (50%).;
Ans
data school;
input Age Quiz : $1. Midterm Final;
if Age eq 12 then Grade= 6;
else if Age eq 13 then Grade = 8;
if Quiz eq 'A' then Course = ((0.2*95)+(0.3*Midterm)+(0.5*Final));
else if Quiz eq 'B' then Course = ((0.2*75)+(0.3*Midterm)+(0.5*Final));
else if Quiz eq 'C' then Course = ((0.2*70)+(0.3*Midterm)+(0.5*Final));
else if Quiz eq 'F' then Course = ((0.2*65)+(0.3*Midterm)+(0.5*Final));
datalines;
12 A 92 95
12 B 88 88
13 C 78 75
13 A 92 93
12 F 55 62
13 B 88 82
;
run;
proc print noobs;
run;


*Question3

Using the Sales data set, list the observations for employee numbers (EmpID) 9888
and 0177. Do this two ways, one using OR operators and the other using the IN
operator.;
Ans
proc import datafile= '/folders/myfolders/Practice/Sales.xls' dbms=xls out=Sales Replace;
run;
proc print;
run;
proc print data=Sales;
where EmpID eq '9888' or EmpID eq '0177';
run;

proc print data=Sales;
where EmpID in ('9888' '0177');
run;


*Question4

Using the Sales data set, create a new, temporary SAS data set containing Region
and TotalSales plus a new variable called Weight with values of 1.5 for the North
Region, 1.7 for the South Region, and 2.0 for the West and East Regions. Use a
SELECT statement to do this;
Ans
data NewFile;
set Sales;
if Region eq 'North' then Weight = 1.5;
else if Region eq 'South' then Weight = 1.7;
else Weight = 2.0;
run;
proc print noobs;
var Region TotalSales Weight;
run;


*Question5

Starting with the Blood data set, create a new, temporary SAS data set containing
all the variables in Blood plus a new variable called CholGroup. Define this new
variable as follows:
􀂃 CholGroup Chol
􀂃 Low Low – 110
􀂃 Medium 111 – 140
􀂃 High 141 – High
Use a SELECT statement to do this.;
Ans.
data BloodData;
infile '/folders/myfolders/Practice/blood.txt';
input ID$ Gender$ Bloodgroup$ Age$ WBC RBC Chol;
if Chol le 110 and not missing(Chol) then CholGroup='Low-110';
else if Chol ge 111 and chol le 140 then CholGroup='111-140';
else if Chol ge 141 then CholGroup='141-high';
run;
proc print noobs;
run;



*Question6

Using the Sales data set, list all the observations where Region is North and
Quantity is less than 60. Include in this list any observations where the customer
name (Customer) is Pet's are Us.;
Ans
proc import datafile= '/folders/myfolders/Practice/Sales.xls' dbms=xls out=Sales Replace;
run;
proc print data=Sales noobs;
where Customer eq "Pet's are Us" OR (Region eq 'North' and Quantity lt 60) ;
run;


Performing Iterative Processing: Looping



*Questin10;

You are testing three speed-reading methods (A, B, and C) by randomly assigning
10 subjects to each of the three methods. You are given the results as three lines of
reading speeds, each line representing the results from each of the three methods,
respectively. Here are the results:
250 255 256 300 244 268 301 322 256 333
267 275 256 320 250 340 345 290 280 300
350 350 340 290 377 401 380 310 299 399
Create a temporary SAS data set from these three lines of data. Each observation
should contain Method (A, B, or C), and Score. There should be 30 observations in
this data set. Use a DO loop to create the Method variable and remember to use a
single trailing @ in your INPUT statement. Provide a listing of this data set using
Ans
PROC PRINT;
data speed_test;
do Method_variable='A','B','C';
do subj=1 to 10;
input Scores @@;
output;
end;end;
datalines;
250 255 256 300 244 268 301 322 256 333
267 275 256 320 250 340 345 290 280 300
350 350 340 290 377 401 380 310 299 399
;
run;
proc print noobs;
run;





Working with Dates

*Data set HOSP;
data hosp;
   do j = 1 to 1000;
      AdmitDate = int(ranuni(1234)*1200 + 15500);
      quarter = intck('qtr','01jan2002'd,AdmitDate);
      do i = 1 to quarter;
         if ranuni(0) lt .1 and weekday(AdmitDate) eq 1 then
            AdmitDate = AdmitDate + 1;
         if ranuni(0) lt .1 and weekday(AdmitDate) eq 7 then
            AdmitDate = AdmitDate - int(3*ranuni(0) + 1);
         DOB = int(25000*Ranuni(0) + '01jan1920'd);
         DischrDate = AdmitDate + abs(10*rannor(0) + 1);
         Subject + 1;
         output;
      end;
   end;
   drop i j;
   format AdmitDate DOB DischrDate mmddyy10.;
run;
proc print data=hosp (obs=10);
run;


*Question4

Using the Hosp data set, compute the subject’s ages two ways: as of January 1, 2006
(Call it AgeJan1), and as of today’s date (call it AgeToday). The variable DOB
represents the date of birth. Take the integer portion of both ages. List the first 10
observations.;
Ans
data AgeJan1;
set hosp;
Age=yrdif(DOB,'01Jan2006'd,'Actual');
run;
title "Listing of AGES1";
proc print data=AgeJan1 (obs=10);
format Age 5.1;
run;

data AgeToday;
set hosp;
Age=yrdif(DOB,Today(),'Actual');
run;
title "Listing of AGES1";
proc print data=AgeToday (obs=10);
format Age 5.1;
run;