Data Analytics for Management (Subject) / Final Revision / (Lesson)

There are 34 cards in this lesson

<3

This lesson was created by Janaw55.

Learn lesson

This lesson is not released for learning.

  • SST, SSM, SSR SST  = Data - Mean     = ∑ (Yi - Ÿ)2 SSM = Model - Mean   =  SSR = Model - Data     = ∑e2 =  ∑ (Yi - Ê)2 SST = Total variability between the actual Data and the Mean SSM > SSR = Predictions from the model are better than using the mean
  • SLR Equations R Square = SSM / SST; or: 1 - SSR / SST dof = Data Points - No.Variables - 1 Standard Error of Estimate = √SSR / dof
  • Hypothesis Testing Create confidence intervals (or find Z-values) for your hypothesized value, not for your sample mean!
  • Residual Analysis Differnece between the observed value of the dependent variable and predicted value (ŷ) is called the residual (e). Both the sum and the mean of the residuals are equal to zero: Σ e = 0 and e = 0. Each data point has one residual. Residual = Observed value - Predicted value Standardized Residuals = (observed count – expected count) / √expected count --> Measure of the strength of the difference between observed and expected values. Residual Plot: Graph that shows the residuals on the Y axis and the Predictor variable on the X axis.The residuals should be randomly distributed around the horizontal line.
  • Standard Error of Estimate The magnitude of Residuals Good indication of how useful the regresion line is for predicting Y from X. Level of accuracy of predictions made from the equation. The smaller, the more accurate the predictions tend to be  Estimates the magnitude of residuals when the explanatory variables are taken together. If explanatory variables are ignored, then the Std. of observed Y gives the magnitude of error.  Can increase as more "bad" variables are added (as Adj. R Square decreases) SE of MR is less that SE of SLR. This implies that when both explanatory variables are taken together, the prediction of Y is better. Regression SE = √SSR / dofRes
  • Multicollinearity strong correlations between two or more predictors in a regression model No X should be an exact linear combination of another X (Redundancy in the data) Causes problems in estimation the coefficients,  Stata vif (Variance Inflation Factor) >10 means multicollinearity
  • Interaction Variables Reason to believe that effect of one X depends on value of another X  Allows to be more realistic by allowing regression lines to have different slopes Without Interaction variable: you are forcing the lines to be parallel.
  • 5 Basic Assumptions of Regression Model (1) There is a population regression line. It joins the means of the Xs for all values of Ys. The mean of the errors is 0, for any fixed values of Xs. (2) All values of Xs are assumed fixed. The only randomness in the values of Y comes from the error term i. (3) For any values of Xs, the variance of Y is constant.  (4) The errors i are normally distributed with 0 mean, i.e., ~ N(0, ó2). (5) The errors are uncorrelated in successive observations.
  • Confidence Interval and Hypothesis test for a regression coefficient Let b be the OLS of β, the true unknown value of the slope. Let sb be the estimated standard deviation of b  Confidence Interval for a regression coefficient:  b = t-multiple * sb t-multiple for 0.025, dof n-k-1 Confidence Interval for a regression coefficient:  t-value = b/sb t-critical for 0.025, dof n-k-1
  • Error vs. Residual Error: Vertical Distance from a point to population regression Line Error for any point labeled ε, is the difference between Y and μY Cannot be calculated Residual: Vertical distance from a point to the Estimated Regression line. Residuals can be calculated from observed data
  • Autocorrelation Residuals are often correlated with nearby residuals i.e. Time-series data, Cross-sectional data observations ordered in some particular way. Lag 1 Autocorrelation: positive autocorrelation: Residuals separated by one time period are correlated The Durbin Watson test reports a test statistic, with a value from 0 to 4, where: 2 is no autocorrelation. 0 to <2 is positive autocorrelation (common in time series data). >2 to 4 is negative autocorrelation (less common in time series data). DW Statistic: Small no of explanatory variables is fairly small < 1.2 
  • Dummy Variable The coefficient X implies that predicted Y for a male would be less than predicfted Y for a female, provided that all other explanatory variables remains constant If not significant: One extra level of X does not improve your Y. It does not matter whether you have X1 or X2. If your would take out X2 (not significant), R Square would go down, but the Adjusted R Square would go up. 
  • Comment on ANOVA Test The obtained test statistic (F-Ration) is 20.84 and the corresponding p-value is almost equal to 0, meaning that the result is significant at the 5% level. Hence,we have enough evidence to reject the Null Hypothesis and conclude that at least one of the coefficients is different from zero (not due to chance). 
  • Not To Do We cannot claim to be able to “hold everything else constant” for a single individual (it often just doesn’t make any sense) Infer Causally: Usually observational data, without deliberately assigned treatments, randomization, and control, we can’t draw conclusions about causes and effects. Do not extrapolate into the future or beyond the data, do not estimate values for x’s outside of data range Sign of a Coefficient is Special, it also depends on the other predictors in the model, so not infer a direct relationship Interpreting an insignificant coefficient, you can’t be sure that the value of the corresponding parameter in the underlying regression model isn’t really zero. Make sure the errors are nearly Normal. All of our inferences require that the true errors be modeled well by a Normal model. Check the histogram and Normal probability plot of the residuals to see whether this assumption looks reasonable. Watch out for high-influence points and outliers. We always have to be on the lookout for a few points that have undue influence on our model, and regression is certainly no exception. 
  • Cause and Effect (Hume, 1748) Cause and Effect must occur close together in time (contiguity) The cause must occur before an effect does  The effect should never occur without the presence of a cause
  • Confounding Variables: The Tertium Quid A Variable that we may or may not have measured other than our predictor variables that potentially affect an outcome variabe  Rulin out confounds: (Mill, 1865) An effect should be present when the causse is present and that when the cause is absent the effect should be absent, also. Control Conditions: The cause is absent Treatment condition: The proposed cause is present
  • Sampling Distribution of a Regression Coefficient Let β be any of the βs, and let b be the least squares estimate of β.  If the regression assumptions are valid, the standardized value t = b − β / sb has a t distribution with n − k − 1 dof The Estimate b is unbiased in the sense that its mean is β, the true but unknown value of the slope. If bs were estimated from repeated samples, some would underestimate β and others would overestimate β, but on average they would be on target. The Estimated standard deviation of b is labeled sb. It is usually called the standard error of a regression coefficient This standard error is related to the standard error of estimate se, but it is not the same.
  • OLS In a simple regression model, the least-squares estimators minimize the sum of squared errors from the estimated regression line. In a multiple regression model, the leastsquares estimators minimize the sum of squared errors from the estimated regression plane.
  • Confidence Interval for a Single Random Manager's Income. Do we have to make any assumptions? For the average income for a single random manager, we have to assume that his income is symmetric like the normal distritbution. This is what enables us to use the Z-multiple of the Normal Distribution 1.96 SD ti get a 95% CI for his mean. However, this does not exaclty hold true as his income could have a different shape (i.e. right skewed) Random Manager: CI = Point Estimate + 1.96 x STANDARD DEVIATION 
  • Correlation Coefficient The Correlation coefficient indicated the degree of linear relationship between two variables. When there is X in the market, Y tends to go down and vice verca. However, the correlation is only very weak Slope does not affect correlation Scaled version of covariance and average product of std variables  Sensitive to outliers
  • Multicollinearity in Data X1 and X2 represent ...  Logically, they are highly correlated in themselves. This is confirmed by the correlation Matrix. It shows that the correlation between X1 and Y is much higer than that between X2 and Y. If we have both variables in our equation, X1 already explains most of the variation in Y that X2 could possibly explain.  By taking out the least significant and least correlated Variable, then the other variable becomes significant.
  • Types of Data If we have both variable in our equation,  Cross-Sectional Data Cross-sectional data is a random sampleEach observation is a new individual, firm, etc. with information at a point in timeIf the data is not a random sample, we have a sample-selection problemv Time Series Data Time series data has a separate observation for each time period – e.g. stock pricesSince not a random sample, different problems to consider Trends and seasonality will be important
  • Reasoning for Choice of Model We use this model as it is the best among the ones we have. Specifically, high R Square all coefficients are significant  no multicollinearity problems. 
  • Reason For Multiple Regression Take other relevant factors into account Tighter Confidence Intervals Smaller Errors  See significant Impact on “target” of single variables Variable (Hypothesis Testing)
  • Treamtemt of outlier The decision whether to include or exclude an outlier remains with the researcher: (1) Why is it an outlier? Was the respondent deliberately giving a wrong anser did not understand the question Typing errors (2) The outlier is real and simply different. In any way he or she musst justify deleting data to the reader of a technical report.  A large outlier can strongly influence the results and should be ruled out.  Depending upon where the outlier falls, the correlation coefficient may be increased or decreased. The smaller the sample size, the greater the effect of the outlier.  Best: Compute correlation coefficient with and without outlier 
  • Treatment of small t-values X contains no additional information that is not already contained in the variables.  The variable is redundant.
  • Principle of Parsimony Favor a model with fewer X, if model explains Y almost as well as a model with additional X Explain the most with the least.
  • Adjusted R Square tests if effect of added explanatory variables is significant or not. If significantly less from R Square, then the effect of added Xs is not signigicant While R2 assumes that every K explains variation in the Y Adjusted R2 tells assume that only K that actually affect Y explain variation on Y Penalizes for adding more K in the equation that do not fit the model.
  • R Square R Square is the square of correlation between the observed data and the fitted values.  If correlation between X and Y is 0.8, R Square will be 0.64. If the correlation drops to 0.7, the percentage drops to 49%
  • Randomization in Controlled Experiments X is randomly assigned (patients randomly assigned a treatment) All other characteristics are residuals Residuals are distributed independently Does this hold true in real experiements?
  • Linearity assumption Uniformly distributed around the Least Squares Line.  -> Polynomials  -> Logstic
  • Uniform Variance Residuals are disstributed completey random around the horizontal line -> Logistic Transformation -> Weighted Least Squares  If wrong, standard errors of estimates are wrong, wrong CI
  • Normality assumption Not too distant from the diagnoal line of the QQ-Plot, PP-Plot Somewhat bell-shapes in the histogram If met, ratios from hypothesis tests follow a t-distribution -> Logistic Transformation
  • Ommitted Variable Bias There are ommitted confounding factors that biad the OLS estimator Do not interpret causal effect