1. For the data set pepsicokes posted on BlackBoard, do the following R operations:
a. Bring the file into R;
b. Create a dummy variable whitecollar ( 1 if white collar, 0 otherwise), representing
stores with white collar percent > 60 and calculate frequencies for this variable; the white collar percentages on the dataset are represented by the variableWhiteCollarPctPen.
c. Do a crosstab of Prizm cluster (the 16 social groups) with Chain; and another crosstab of Prizm cluster with whitecollar (use table() function); the social groups are represented on the dataset by the variableprizm_cluster.
d. Calculate descriptive statistics for any 5 continuous variables in the dataset. (you should already know from the previous assignment that what function you must use!)
e. Regress Corp_Pep_Volume_per__MM_ACV (million dollars All Commodity Volume = Total Store Sales) on a set of any 6 independent variables. (use lm() for linear regression)
f. Run a Hierarchical Cluster Analysis using the variables Pepsi Volume per MM ACV, Pepsi Price per MM ACV, Coke Volume per MM ACV, Coke Price per MM ACV. Generate a Scree plot and find the kink point to determine the optimal number of clusters.
g. Run a Kmeans Cluster Analysis using the number of clusters determined in d.
2. Consider the regression below (and on the next page) that was estimated on weekly data over a 2year period on a sample of Kroger stores for Pepsi carbonated soft drinks. The dependent variable is the log of Pepsi volume per MM ACV. There are 53 stores in the dataset (data were missing for some stores in some weeks). Please answer the following questions about the regression output.
a. Comment on the goodness of fit and significance of the regression, and of individual variables. What does the ANOVA table reveal?
b. Write out the equation and interpret the meaning of each of the parameters.
c. What is the price elasticity? The crossprice elasticity with respect to Coke price? Are these results reasonable? Explain
d. What do the results tell you about the effectiveness of Pepsi and Coke display and advertising?
e. What are the 3 most important variables? Explain how you arrived at this conclusion.
f. What is collinearity? Is collinearity a problem for this regression? Explain. If it is a problem, what action would you take to deal with it?
g. What changes to this regression equation, if any, would you recommend? Explain.
Model Summary(b)
Model 
R 
R Square 
Adjusted R Square 
Std. Error of the Estimate 
1 
.869(a) 
.754 
.754 
.4120 
a Predictors: (Constant), Mass stores in trade area, Labor Day dummy, Pepsi advertising days, Store traffic, Memorial Day dummy, Pepsi display days, Coke advertising days, Log of Pepsi price, Coke display days, Log of Coke price
b Dependent Variable: Log of Pepsi volume/MM ACV
ANOVA(b)
Model 
Sum of Squares 
df 
Mean Square 
F 
Sig. 

1  Regression 
2881.089 
10 
288.109 
1697.262 
.000(a) 
Residual 
937.695 
5524 
.170 

Total 
3818.784 
5534 
a Predictors: (Constant), Mass stores in trade area, Labor Day dummy, Pepsi advertising days, Store traffic, Memorial Day dummy, Pepsi display days, Coke advertising days, Log of Pepsi price, Coke display days, Log of Coke price
b Dependent Variable: Log of Pepsi volume/MM ACV
Coefficients(a)
Model 
Unstandardized Coefficients 
Standardized Coefficients 
t 
Sig. 
Collinearity Statistics 

B 
Std. Error 
Beta 
Tolerance 
VIF 

1  (Constant) 
7.79429 
.06249 
124.721 
.000 

Log of Pepsi price 
3.34665 
.03483 
.739 
96.091 
.000 
.751 
1.332 

Log of Coke price 
.65877 
.03170 
.181 
20.784 
.000 
.587 
1.703 

Pepsi advertising days 
.00173 
.00020 
.065 
8.644 
.000 
.784 
1.275 

Coke advertising days 
.00009 
.00018 
.004 
.502 
.616 
.689 
1.450 

Pepsi display days 
.00011 
.00021 
.004 
.546 
.585 
.656 
1.525 

Coke display days 
.00299 
.00020 
.123 
14.766 
.000 
.646 
1.549 

Labor Day dummy 
.27190 
.04167 
.045 
6.525 
.000 
.923 
1.083 

Memorial Day dummy 
.21295 
.04269 
.036 
4.988 
.000 
.834 
1.199 

Store traffic 
.00000 
.00000 
.023 
3.367 
.001 
.961 
1.040 

Mass stores in trade area 
.00910 
.00026 
.238 
35.161 
.000 
.968 
1.033 
a Dependent Variable: Log of Pepsi volume/MM ACV
3. The tables on the following 4 pages summarize the output of KMeans clustering for the WNY soft drink file that I used to demonstrate regression.
Since clusters should reflect consumer and firm behavior for the focal product category, I used the following bases for clustering: Pepsi Volume per MM ACV, Pepsi Price per MM ACV, Coke Volume per MM ACV, Coke Price per MM ACV. These variables were all standardized to mean 0, standard deviation 1 (Zscores) before clustering.
A 4cluster solution, which I think provides a good description of this market, is provided below. Output consists of averages of standard scores (Zscores) for each cluster, ANOVA tests of significant differences between these averages across clusters, and number of stores in each cluster.
Descriptor variables for each cluster were taken to be: Currentpop, MedianHHincome,
MedianYrsSchool, WhiteCollarPctPen, Farm_Forest_FishPctPen, BlueCollarPctPen,
MedianHomeValue, WhitePopPctPen, BlackPopPctPen, Groc_Miles,
Mass_Miles. Descriptives and ANOVA tests are presented for the continuous variables. Crosstabs are presented for chain and prizm social group.
a. Interpret the results of the clustering. Name each cluster, and describe what it stands for. Explain your choices.
b. What can you say from the results about the market for soft drinks, and about the apparent strategies of Tops and Wegmans, and Pepsi and Coke?
c. Develop a strategy for targeting each of the four segments for Pepsi.
KMeans Clustering of WNY Data
Bases
Final Cluster Centers
Corp_Pepsi_  Corp_Coke_  
Volume_per__  Corp_Pepsi_  Volume_per__  Corp_Coke_  
Cluster  MM_ACV  Price  MM_ACV  Price 
1 
0.9092 
1.1475 
0.8399 
0.8676 
2 
0.2813 
0.5680 
1.2003 
0.6806 
3 
0.1705 
0.0227 
0.1792 
1.2173 
4 
1.5359 
1.1635 
0.6072 
0.1488 
Means are for standardized variables – scale = mean 0, stdev = 1 
ANOVA
Cluster 
Error 
F 
Sig. 

Variable 
Mean Square 
df 
Mean Square 
df 
RSquare  
Corp_Pep_Volume_per__MM_ACV 
29.3050 
3 
0.3567 
132 
87.92 
0.0000 
0.6512 
Corp_Pep_Price 
30.4930 
3 
0.3297 
132 
91.48 
0.0000 
0.6776 
Corp_Coke_Volume_per__MM_ACV 
27.7033 
3 
0.3931 
132 
83.11 
0.0000 
0.6156 
Corp_Coke_Price 
36.1290 
3 
0.2016 
132 
108.39 
0.0000 
0.8029 
The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.
Cluster Summary and Number of Cases in each Cluster
Cluster  Frequency in Cluster 
RMS Std Deviation 
Maximum Distance from Seed to Observation  Nearest Cluster  Distance Between Cluster Centroids  
1 
39 
0.4396 
1.7172 
3 
2.7020 

2 
32 
0.4881 
2.6067 
3 
2.2685 

3 
43 
0.5985 
2.8373 
2 
2.2685 

4 
22 
0.7716 
2.2536 
2 
2.4256 
Descriptors
Descriptives
Variable and Cluster 
N 
Mean 
Std Dev 
Currentpop 

1 
39 
44707.74 
17134.62 
2 
32 
39931.44 
15379.06 
3 
43 
54964.19 
23737.98 
4 
22 
14805.00 
14126.24 
Overall 
136 
41989.54 
22814.25 
MedianHHincome 

1 
39 
48707.69 
14214.63 
2 
32 
44103.84 
13361.84 
3 
43 
43920.47 
12139.50 
4 
22 
36287.27 
4185.07 
Overall 
136 
44101.64 
12748.94 
MedianYrsSchool 

1 
39 
13.48 
0.82 
2 
32 
13.15 
0.63 
3 
43 
13.23 
0.74 
4 
22 
12.74 
0.23 
Overall 
136 
13.20 
0.72 
WhiteCollarPctPen 

1 
39 
61.64 
9.29 
2 
32 
57.73 
9.61 
3 
43 
57.25 
9.15 
4 
22 
48.23 
5.09 
Overall 
136 
57.16 
9.72 
Farm_Forest_FishPctPen 

1 
39 
1.36 
1.26 
2 
32 
1.42 
1.74 
3 
43 
1.93 
1.71 
4 
22 
4.15 
2.95 
Overall 
136 
2.00 
2.09 
BlueCollarPctPen 

1 
39 
24.25 
6.78 
2 
32 
26.32 
6.31 
3 
43 
26.59 
6.53 
4 
22 
29.42 
5.01 
Overall 
136 
26.31 
6.49 
Variable and Cluster 
N  Mean  Std Dev 
MedianHomeValue 

1 
39 
104816.67 
28068.36 
2 
32 
92660.03 
26336.36 
3 
43 
96370.60 
29053.05 
4 
22 
85279.36 
15407.20 
Overall 
136 
96125.39 
26915.04 
WhitePopPctPen 

1 
39 
83.84 
16.49 
2 
32 
86.13 
17.39 
3 
43 
85.58 
18.46 
4 
22 
95.20 
3.95 
Overall 
136 
86.77 
16.40 
BlackPopPctPen 

1 
39 
9.56 
12.73 
2 
32 
9.38 
16.09 
3 
43 
8.67 
13.90 
4 
22 
1.88 
2.20 
Overall 
136 
7.99 
13.16 
Groc_Miles 

1 
39 
2.90 
0.69 
2 
32 
3.20 
1.60 
3 
43 
2.61 
1.60 
4 
22 
5.54 
4.27 
Overall 
136 
3.31 
2.32 
Mass_Miles 

1 
39 
2.49 
0.90 
2 
32 
2.52 
1.01 
3 
43 
2.30 
1.67 
4 
22 
4.87 
4.81 
Overall 
136 
2.82 
2.40 
ANOVA Tests of Equality of Group Means
Variable  DF Cluster  DF Error 
F 
PROB 
Currentpop 
3 
132 
22.7096 
0.00000 
MedianHHincome 
3 
132 
4.8345 
0.00317 
MedianYrsSchool 
3 
132 
5.6417 
0.00114 
WhiteCollarPctPen 
3 
132 
10.9795 
0.00000 
Farm_Forest_FishPctPen 
3 
132 
12.2661 
0.00000 
BlueCollarPctPen 
3 
132 
3.1651 
0.02670 
MedianHomeValue 
3 
132 
2.8355 
0.04066 
WhitePopPctPen 
3 
132 
2.5265 
0.06025 
BlackPopPctPen 
3 
132 
1.9613 
0.12291 
Groc_Miles 
3 
132 
10.2798 
0.00000 
Mass_Miles 
3 
132 
7.3739 
0.00013 
Chain * Cluster Number of Case Crosstabulation
Frequency 
Cluster 

Row Pct 
1 
2 
3 
4 
Total 
TOPS 
0 
31 
32 
22 
85 
0 
36.47 
37.65 
25.88 

WEGMANS 
39 
1 
11 
0 
51 
76.47 
1.96 
21.57 
0 

Total 
39 
32 
43 
22 
136 
ChiSquare Tests for Chain * Cluster Number of Case Crosstabulation
Statistic DF Value Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ChiSquare 3 96.9395
Likelihood Ratio ChiSquare 3 122.1429
MantelHaenszel ChiSquare 1 58.5519
Prizm_Cluster Crosstabulation Cluster Number of Case *
Cluster Number of Cases 

Cluster 
1 
2 
3 
4 
C1 
2 
0 
0 
1 
C2 
3 
4 
6 
0 
C3 
1 
4 
5 
0 
S2 
7 
1 
6 
0 
S3 
11 
7 
7 
0 
S4 
2 
4 
2 
0 
T1 
2 
4 
1 
0 
T2 
4 
2 
3 
1 
T3 
0 
2 
3 
13 
T4 
4 
0 
5 
7 
U1 
0 
1 
0 
0 
U2 
1 
0 
0 
0 
U3 
2 
3 
5 
0 