分享

Advanced Settings

 zdloy 2010-08-19

Advanced Settings

The advanced settings allows you more options to customize your analysis.  To set your advanced settings, click the  Advanced Settings button.

 

advanced_settings_20.gif

 

Resolving Duplicate Identifiers

If a dataset contains duplicate identifiers for the same molecule, the identifier with the highest Expression Value (lowest when the Expression Value is a p-value), by default, is used in the analysis. When there are multiple Expression Value types, you can determine which type should be used to resolve duplicates. In the absence of Expression Values, the first instance of the molecule is used in the analysis.

 

You can change how the application should resolve duplicate identifiers in the Advanced Settings window. Select the Expression Value type and the preferred value (maximum, average, median or minimum) to be used in the analysis if the dataset contains duplicate identifiers for the same molecule. For example, if you choose Fold Change and Maximum for this parameter, then the identifier with the highest fold change value (absolute value) will be used for the analysis. For Ratio and Fold Change expression values, Minimum and Maximum refer to the lowest and highest magnitudes, respectively (absolute values), and the application uses Maximum as the default setting. For p-values, the application defaults to the Minimum setting.

 

The following logic is applied when mapping identifiers and resolving dupicate identifiers within IPA:

1) IPA maps the identifiers. Unmapped genes are not used for the analysis.

2) Absent flag is applied. Any identifiers marked as absent are not used for the analysis.

3) Override flag is applied.

4) Duplicate identifiers are found.

a) If any duplicate identifiers are marked as override, remove duplicate probes for the identifier that are NOT marked as override.

b) If more than one duplicate identifier is marked as override, use the remaining steps of the resolving duplicate process to determine the proper expression value.

c) Resolve other duplicate identifiers not marked as override using the resolving duplicate process.

5) User-specified cutoff values are applied (ignores Override molecules)

6) User-specified Filter criteria is applied (may filter out Override molecules)

7) The results of this are then used for Enriched Datasets and Analyses.  

 

IPA uses the following definitions to resolve duplicates:

1. Maximum: for all expression value types, take the absolute value for the duplicates and the largest value (furthest away from zero) will be considered as the "maximum."

 

2. Minimum: for all expression value types, take the absolute value for the duplicates and the smallest value (closest to zero) will be considered as the "minimum."

 

3. Median:

a. For p-value, FDR, or Intensity, if there are an even number of Ids, then average the two middle ones.  If there are an odd number of Ids, use the midpoint).

b. For Log Ratio, Ratio and Other, take the values as is (do not take the absolute values) and then apply the rules from 3a.

c. For Fold Change: The values are sorted. If there are an odd number of identifiers, the middle value is used. If there are an even number, values less than zero are converted to their negative inverse, and the average of the logarithm of the two middle values is computed, then its exponent is taken, and if less than 1, is converted to its negative inverse.

 

4. Average:

a. For p-value, FDR, and Intensity sum all of the values and divide by the number of values.

b. For Log Ratio, Ratio and Other, take the values as is (do not take the absolute values) and then apply the rules from 4a.

c. For Fold-Change, values less than zero are converted to their negative inverses, then take the natural log (ln) of the values, and average them, then exponent is taken and if less than 1 converted to its negative inverse.

 

Notes:

1) Fold-Change values must not have values between -1 and +1. For an explanation of this, click here.

 

2) When Resolve Duplicates is set to Average or Median a new expression value is calculated (modified from the original) and saved. The original value still exists but is not shown.  The newly calculated expression value will be associated with the first identifier corresponding to the duplicate molecule from the dataset.

 

3) When users view this calculated value in Mapped IDs, Network Eligible or Function Eligible IDs, Molecules tab, or in Overlay Analyzed Dataset the new expression values will have a * at the end. This is to indicate that the expression value has been modified as a result of resolving the duplicates. In such case the molecule name also has a * at the end.

 

4) Resolving duplicate identifiers may cause a mapped ID to be excluded from your analysis. This is due to IPA first resolving the duplicates, and then applying the cutoff value and/ or filters. If this molecule is especially important for your analysis, you may try marking it as Override to ensure its inclusion in the analysis. You may also consider marking duplicate identifiers that are causing the identifier to be pulled out of the analysis as "Absent"  so that the analysis can continue without them. Alternatively, you can manually remove duplicates from your data upload file based upon your own criteria (quality of a probeset, possible outlier, etc).

 

5) In the case that no expression values are included in the uploaded dataset, IPA uses the first instance of the duplicate identifier.

 

Examples

Example 1:

Id A: -10

Id B:  -5

Id C:  3

Id D:  6

 

Ids A - D map to the same gene.

Assume all IDs map in IPA. Expression values are fold-changes.

Assume a cutoff value of 2.0 in the analysis settings.

 

Resolve Duplicates using Maximum

Id A, with a fold-change of -10 would be used in the rest of the analysis because its absolute value is greater than those from Ids B, C, and D.

 

The Fold-change value of -10 meets the cutoff of 2.0, so this identifier would be included in the analysis.

 

Resolve Duplicates using Minimum

Id C, with a fold-change of 3 would be used in the rest of the analysis because its absolute value is smaller than those from Ids A, B, and D.

 

The fold-change value of 3 meets the cutoff of 2.0, so this gene would be included in the analysis.

 

Resolve Duplicates using Median

1. Since the expression values are fold-change, first sort them.

-10, -5, 3, 6

 

2. Take the negative inverse of the values less than 0:

0.1, 0.2, 3, 6

 

3. Since there are an even number of IDs, take the natural log (ln) of the 2 the middle numbers:

ln (0.2)= -1.6

ln (3) = 1.1

 

4. Average these 2 values (1.1 + -1.6)/2 = -0.25

 

5. Take the exponent = 0.77

 

6. Since the value is less than 1, take the negative inverse: -1.29

 

7. This new expression value will be associated with Id A from the dataset, since this identifier came first.

 

8. Since the cutoff is 2.0 for the analysis, the median fold-change of -1.29 does not meet the cutoff, and this gene would be excluded from the analysis.

 

Resolve Duplicates using Average

1. Since the expression values are fold-change, calculate the negative inverse of values <0.

0.1, 0.2, 3, 6

 

2. Take the natural log of all values

ln (0.1) = -2.3

ln (0.2) = -1.6

ln (3) = 1.1

ln (6) = 1.8

 

3. Calculate the average: (-2.3 + -1.6 + 1.1 + 1.8)/4 = -1

 

4. Take the exponent =0.37

 

5. Take the negative inverse: -2.7

 

6. This new expression value will be associated with Id A from the dataset, since this identifier came first.

 

7. The average fold-change of -2.7 meets the cutoff of 2.0, so this gene would be included in the analysis.

 

 

Example 2:

Id A: 0.7

Id B: 2.0

Id C: 1.6

 

Ids A - C map to the same gene.

Assume all IDs map in IPA. Expression values are ratios.

Assume a fold-change cutoff of 2.0 for the analysis settings.

 

Resolve Duplicates using Maximum

Id B, with a ratio of 2 would be used in the rest of the analysis because its value is greater than those from Ids A and C.

 

Since the ratio of 2 meets the cutoff of 2, this gene would be included in the analysis.

 

Resolve Duplicates using Minimum

Id A, with a ratio of 0.7 would be used in the rest of the analysis because its absolute value is smaller than those from Ids B and C.

 

The ratio of 0.7 corresponds to a fold-change of -1.4. This expression value does not meet the cutoff of 2.0, so the gene would be excluded from the analysis.

 

Resolve Duplicates using Median

1. Since the expression values are ratio, put the values in order from smallest to largest:

0.7, 1.6, 2.0

 

2. Since there are an odd number of IDs, take the middle number = 1.6

 

3. This new expression value will be associated with Id A from the dataset, since this identifier came first.

 

4. The median ratio of 1.6 does not meet the cutoff of 2.0 in the analysis settings, so the gene would be excluded from the analysis.

 

Resolve Duplicates using Average

1. Since the expression values are ratio, just calculate the average of all values

(0.7 + 2.0 + 1.6)/3 = 1.43

 

2. This new expression value will be associated with Id A from the dataset, since this identifier came first.

 

3. The average ratio of 1.43 does not meet the cutoff of 2.0 in the analysis settings, so the gene would be excluded from the analysis.

 

Example 3:

Id A: 0.05

Id B: 0.001

Id C: 0.01

 

Ids A-C map to the same gene.

Assume all IDs map in IPA. Expression values are p-values.

Assume a p-value cutoff of 0.02.

 

Resolve Duplicates Using Maximum

Id A, with a p-value of 0.05 would be used in the rest of the analysis.

 

The p-value 0.05 is greater than 0.02 (cutoff value), so this gene would be excluded from the analysis.

 

Resolve Duplicates Using Minimum* (default setting for p-value)

Id B, with a p-value of 0.001 would be used in the rest of the analysis.

 

The p-value of 0.001 is less than the cutoff of 0.02 in the analysis settings, so this gene would be included in the analysis.

 

Resolve Duplicates Using Median

Id C, with a p-value of 0.01 would be used because this is the middle value. (0.001, 0.01, 0.05)

This newly calculated expression value would be associated with Id A from the dataset, since this identifier came first.

 

The p-value 0.01 is less than the cutoff of 0.02 in the analysis settings, so this gene would be included in the analysis.

 

Resolve Duplicates Using Average

(0.05 + 0.001 + 0.01)/3 = 0.0203

 

This newly calculated expression value would be associated with Id A from the dataset, since this identifier came first.

 

The p-value of 0.0203 is greater than the cutoff of 0.02 in the analysis settings, so this gene would be excluded from the analysis.

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多