6 The Analyze Phase

朗朗xl 2017-02-25

展开全文

Objectives

The objectives of the ANALYZE phase would be to:

Arrive at the root cause by process analysis or data analysis

Quantify the opportunity for the project

The true reason why a problem could exist in the process is unearthed in the Analyze phase.

The goal of analysis can be defined by the equation -

Solving for Y = f (X 1, X 2, …..X n)

The goal in the analysis phase is to determine which factors (Xs) in the process are the largest contributors to the performance of Y (output).

A. Exploratory Data Analysis

Data analysis can be divided into two phases: the explanatory phase and the confirmatory phase. Before actually studying the problem and establishing a cause and effect theory, one must thoroughly examine the data for patterns and trends or gaps. This is called exploratory data analysis.

Exploratory Data Analysis (EDA) is an approach for data analysis that utilizes a variety of techniques (mostly graphical) to:

1. maximize insight into a data set
2. reveal underlying structure
3. extract important variables
4. detect outliers and irregularities
5. test underlying assumptions
6. develop economical models; and determine optimal factor settings

Four arguments come into view repeatedly throughout EDA:

1. Resistance

It refers to the insensitivity of a method to a small change in the data. If the small amount of the data is tainted, the method should not produce significantly different conclusions.

2. Residuals

Residuals are what remain after removing the effect of a model .For example; one might subtract the mean from each value, or look at deviations about a regression line.

3. Re-expression

It involves examination of different scales on which the data are displayed.

4. Visual Display

It helps the analyst to examine graphically to point out the regularities and abnormalities in the data.

There are a wide number of EDA methods and techniques, but two of them are used frequently in Six Sigma: stem-and-leaf plots and box plots. However, graphics of EDA are simple enough to be drawn by hand.

1. Multi-Variate Studies

Multi Variate studies is the study about the identification of the benefits of visualization of the relationships between key process input and output variables. They involve the matching up of data visualization techniques with equivalent images and also with examples of the types of data to which they are best well-matched. They also match the families of variation shown by Multi-Variate charts with examples.

Multi-Variate Charts

A multivariate chart is a control chart for variables data (See Chapter 8: Black Belt, Control for information on control charts). Multivariate Charts are used to find out shifts in the mean or the association (covariance) between numerous linked parameters.

Several charts are accessible for multivariate analysis:

The T 2 control chart, based upon Hotelling T 2 statistic, is used to detect shifts in the process. This statistic is calculated for the process' Principal Components, which are linear combinations of the Process Variables. The Principal Components (PC) are independent of one another however, the Process Variables may be correlated with one another. Independence of components is necessary for the analysis. The PCs may be used to estimate the data and thereby provide a basis for an estimate of the prediction error. The number of PCs may never exceed the number of process variables and is often constrained to be fewer.

1. The Squared Prediction Error (SPE) chart may also be used to detect shifts. The SPE is based on the error between the raw data and a fitted Principal Component model to that data.

2. Contribution Charts are presented for determining the Process Variables' contributions to either the Principal Component (Score Contributions) or the SPE (Error Contributions) for a given sample. This is principally effective for determining the Process Variable that is responsible for process shifts. These process variables are restricted to subgroups of size one.

3. Loading Charts offer an indication of the relative contribution of each Process Variable towards a given Principal Component for all groups in the analysis.

Uses of Multi-Variate charts

A Multivariate Analysis (MVA) may be valuable in SPC whenever there is more than one process variable. This becomes more useful when the effect of multiple parameters is dependent or there is a correlation between some parameters. Sometimes the true source of variation may not be measurable.

An important point is that almost all processes are multivariate but analysis is frequently not required because there are only a few independent controlled variables. However, even when the variables become dependent, the use of a single control chart for each variable increases the probability of randomly finding a variable ‘out of control’; the more variables there are, the more likely it is that one of those charts will contain an ‘out of control’ condition even when the process has not shifted. Thus, the probability of taking a wrong decision (or probability of Type 1 error) is increased if each variable is controlled separately. So the control region for two separately acting variables is a rectangle; an ellipse would be formed as the control region for two jointly-acting parameters.

2. Measuring and Modeling Relationships between Variables

a. Simple Least-Squares Linear Regression

The use of regression analysis is very important in Six Sigma. Regression analysis helps the analyst to study cause and effect of a problem. This can be used in every stage of problem solving and planning process.

Regression is the study of analysis of data aimed at discovering how one or more variables (called independent or predictor variables) affect the other variables (called dependent or response variables). Such analysis is called regression. It tells about the nature of relationship between two or more variables.

For e.g., (1) you may be interested in studying the relationship between blood pressure and age or between height and weight of a person. Here only two variables are used. This is an example of Simple Linear Regression.

(2) The response of an experimental animal to some drug may depend on the size of the dose and the age and weight of the animal. Here more than two variables are used. So it is a case of Multiple Regression.

The Regression Model

It is an application of the linear model where the population of the response or dependent variables is identified with numeric values of one or more quantitative or independent variables. The purpose of statistical analysis of a regression model is not to make an inference.

One difference among the means of those populations is rather to make inferences about the relationship of mean of the response variables. These inferences are made through the parameter of the model.

For Example:

1. Estimating weight gain by the addition of different amounts of various dietary supplements to a man’s diet.

2. Estimating the amount of sales associated with levels of expenditure for various types of advertising.

Regression Line

For the amount of change that normally takes place in variable Y for a unit change in X, a line will have to be fitted to the points plotted in the scatter diagram. This is called regression line or linear regression.

The regression line tells about the average relationship between two variables for the whole series. It is also called the line of average relationship.

Simple Linear Regression Equation

The standard form of equation describing a line is

Y= α + β X

When this equation describes the line marking the path of the points in a scatter diagram, it is called regression equation. The line it describes is called the line of regression of Y on X.

The values of and α and β in the equation are termed constants i.e. these values are fixed. The first constant α indicates the value of Y when X=0, it is also called the Y-intercept.

The value β indicates the slope of the regression line and it gives us a measure of change in Y for a unit change in X. It is also called regression coefficient of Y on X. If you know the values of α and β, you can easily compute the value of Y for a given value of X.

The values of α and β are calculated with the help of the following two normal equations:

Standard Error

If the measure of scatter of points from the regression line is less than the measure of the scatter of observed values of y from their mean, it can be inferred that the regression equation is likely to be useful in estimating Y. The scatter of points from the regression line is called standard error of estimating Y.

It is observed by the form:

where, SY = Standard error of estimate

Y = Observed value of Y

Y C = Estimated value of Y

N = No. of pairs of values

Regression Model:

In the simple linear regression model two variables, X and Y are taken. The following are the assumptions underlying the simple linear regression model:

Assumptions

a. The values of independent variable X are said to be fixed by the investigator i.e. X is referred as non-random variable.

b. The variable X is measured without error i.e. the magnitude of the measurement error in X is negligible.

c. For each value of X, there is a sub population of Y values. For the usual inferential procedures of estimation and hypothesis testing to be valid, these sub populations must be normally distributed.

d. The variances of subpopulations of Y are equal and the means of the subpopulations of Y lie on the same straight line.

e. The values of Y are statistically independent i.e. the values of Y chosen at one value of X in no way depend on the values of Y chosen at another value of X.

These assumptions may be summarized by means of the following equation, which is called the regression model:

where, y is a typical value from one of the subpopulations of Y, α and β are called population regression coefficients and e is the error term.

where, e shows the amount by which y deviates from the mean of the subpopulations of Y values from which it is drawn. e’s for each subpopulation are also normally distributed with a variance equal to the common variance of the subpopulations of Y values.

Scatter Diagram

A first step that is useful in studying the relationship between two variables is to prepare a scatter diagram of the data. The points are plotted by assigning values of independent variables X to the horizontal axis and values of the dependent variable Y to the vertical axis. The pattern made by the points plotted on the scatter diagram usually suggests the basic nature and strength of the relationship between two variables. These impressions suggest that the relationship between two variables may be describing by a straight line crossing the Y-axis below the origin and making approximately a 45-degree angle with X-axis. It looks as if it would be simple to draw, freehand, through the data points the line that describe the relationship between X and Y. In fact, it is not likely that any freehand line drawn through the data will be the line that best describe the relationship, since freehand lines will reflect any defects of vision or judgment of the person drawing the line.

Usage of Scatter Diagram: Scatter diagrams are used to study cause and effect relationships in Six Sigma. The underlying assumption is that the independent variables are causing a change in response variables. It answers questions like, “In the production process, is output of machine A better than Output of machine B?” etc.

The Least-Squares Line

The method usually employed for obtaining the desired line is known as the method of least squares, and the resulting line is called the least- squares line. The least-squares line does not pass through the observed points that are plotted on the scatter diagram.

The line that you have drawn through the points is best in this sense if:

The sum of the squared vertical deviations of the observed data points (y i) from the least-squares line is smaller than the sum of the squared vertical deviations of the data points from any other line.