目录 假设要处理的数据包含 3 个代码变量,household_ID, city_ID, 以及 state_ID。这 3 个变量分别表示不同个体所在的家庭、城市以及州的代码。若将这 3 个代码变量视为一个组合,那么组合内不同的数字排列,则分别代表了不同的个体。 本推文将介绍如何在这种情况下,生成不同个体的唯一代码 (unique ID)。 1. 数据生成首先设置 25,000 个观测值。 clear set obs 25000 假设变量 household_ID 的取值范围为 [1, 50],变量city_ID 的取值范围为 [1, 20],变量 state_ID 的取值范围为 [1, 50]。 *随机生成 gen household_ID = ceil(runiform()*50) gen city_ID = ceil(runiform()*20) gen state_ID = ceil(runiform()*50) 此外,生成变量 *随机生成 gen x1 = rnormal() gen x2 = rnormal() 数据结构如下所示: . list household_ID city_ID state_ID x1 x2 in 1/10 +-------------------------------------------------------+ | househ~D city_ID state_ID x1 x2 | |-------------------------------------------------------| 1. | 18 10 28 2.025588 -.1922264 | 2. | 14 11 8 1.042631 -.0807038 | 3. | 7 17 21 .2977124 .6150526 | 4. | 2 4 28 -1.722132 -1.358765 | 5. | 44 8 36 -.7291995 .0929139 | |-------------------------------------------------------| 6. | 18 12 18 .8618261 .6687715 | 7. | 4 4 29 -.239354 .6361541 | 8. | 17 14 2 .516549 -.6399707 | 9. | 28 15 19 -1.812016 -1.628398 | 10. | 44 2 23 -1.015124 -.7855705 | +-------------------------------------------------------+ 2. 创建唯一代码 (Unique ID) 的两种方式2.1 用 egen 命令创建 Unique ID创建 Unique ID 最简单的方法就是使用 . egen ID = group(household_ID city_ID state_ID) . list in 1/10 +---------------------------------------+ | househ~D state_ID city_ID ID | |---------------------------------------| 1. | 18 28 10 6890 | 2. | 14 8 11 5296 | 3. | 7 21 17 2724 | 4. | 2 28 4 469 | 5. | 44 36 8 17091 | |---------------------------------------| 6. | 18 18 12 6925 | 7. | 4 29 4 1253 | 8. | 17 2 14 6553 | 9. | 28 19 15 10945 | 10. | 44 23 2 16966 | +---------------------------------------+ 通过 . sort household_ID city_ID state_ID \\排序 . list household_ID city_ID state_ID ID in 1/10 +------------------------------------+ | househ~D city_ID state_ID ID | |------------------------------------| 1. | 1 1 2 1 | 2. | 1 1 5 2 | 3. | 1 1 7 3 | 4. | 1 1 10 4 | 5. | 1 1 10 4 | |------------------------------------| 6. | 1 1 11 5 | 7. | 1 1 15 6 | 8. | 1 1 15 6 | 9. | 1 1 20 7 | 10. | 1 1 21 8 | +------------------------------------+ 2.2 创建字符型 Unique ID另一种创建 Unique ID 变量的方式是创建一个字符 (string) 变量。虽然这种方式相对复杂,但所产生的 Unique ID 变量也更直接,容易识别。 具体命令如下: . gen ID3 = "H" + string(household_ID,"%2.0f" ) + "C" + string(city_ID) + "S" + string(state_ID) . list household_ID city_ID state_ID ID3 in 1/10 +-------------------------------------------+ | househ~D city_ID state_ID ID3 | |-------------------------------------------| 1. | 18 10 28 H18C10S28 | 2. | 14 11 8 H14C11S8 | 3. | 7 17 21 H7C17S21 | 4. | 2 4 28 H2C4S28 | 5. | 44 8 36 H44C8S36 | |-------------------------------------------| 6. | 18 12 18 H18C12S18 | 7. | 4 4 29 H4C4S29 | 8. | 17 14 2 H17C14S2 | 9. | 28 15 19 H28C15S19 | 10. | 44 2 23 H44C2S23 | +-------------------------------------------+ 3. 处理重复观测由于数据中存在重复观测值,创建的唯一代码 (Unique ID) 其实也并不唯一。 当运行 . sum ID Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ID | 25,000 9890.821 5709.989 1 19757 . duplicates report ID Duplicates in terms of ID -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 15298 0 2 | 7528 3764 3 | 1836 1224 4 | 312 234 5 | 20 16 6 | 6 5 -------------------------------------- 此时,应当确定一个处理 collapse (mean) mean_x1=x1 mean_x2=x2 (median) /// med_x1=x1 med_x2=x2, by(ID) 对于重复观测值,该命令运行后会生成新变量,保留每个 ID 重复观测值的平均值以及中位数。数据结构如下 . list in 1/10 +----------------------------------------------------+ | ID mean_x1 mean_x2 med_x1 med_x2 | |----------------------------------------------------| 1. | 1 .1792767 .7344462 .1792767 .7344462 | 2. | 2 1.093224 -.1823799 1.093224 -.1823799 | 3. | 3 -.3094974 2.653506 -.3094974 2.653506 | 4. | 4 .4993488 .8985389 .4993488 .8985389 | 5. | 5 -.7429997 -.0464208 -.7429997 -.0464208 | |----------------------------------------------------| 6. | 6 .4437077 -.5161228 .4437077 -.5161228 | 7. | 7 -.7664803 1.15589 -.7664803 1.15589 | 8. | 8 .59837 .1051743 .59837 .1051743 | 9. | 9 .9568613 -.7659643 .9568613 -.7659643 | 10. | 10 -.8682789 -1.468759 -.8682789 -1.468759 | +----------------------------------------------------+ 此外,处理后,可以发现数据中已经不存在重复观测值了。 . duplicates report ID Duplicates in terms of ID -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 19757 0 --------------------------------------
相关课程
课程一览
![]() ![]()
假设要处理的数据包含 3 个代码变量,household_ID, city_ID, 以及 state_ID。这 3 个变量分别表示不同个体所在的家庭、城市以及州的代码。若将这 3 个代码变量视为一个组合,那么组合内不同的数字排列,则分别代表了不同的个体。 本推文将介绍如何在这种情况下,生成不同个体的唯一代码 (unique ID)。 1. 数据生成首先设置 25,000 个观测值。 clear set obs 25000 假设变量 household_ID 的取值范围为 [1, 50],变量city_ID 的取值范围为 [1, 20],变量 state_ID 的取值范围为 [1, 50]。 *随机生成 gen household_ID = ceil(runiform()*50) gen city_ID = ceil(runiform()*20) gen state_ID = ceil(runiform()*50) 此外,生成变量 *随机生成 gen x1 = rnormal() gen x2 = rnormal() 数据结构如下所示: . list household_ID city_ID state_ID x1 x2 in 1/10 +-------------------------------------------------------+ | househ~D city_ID state_ID x1 x2 | |-------------------------------------------------------| 1. | 18 10 28 2.025588 -.1922264 | 2. | 14 11 8 1.042631 -.0807038 | 3. | 7 17 21 .2977124 .6150526 | 4. | 2 4 28 -1.722132 -1.358765 | 5. | 44 8 36 -.7291995 .0929139 | |-------------------------------------------------------| 6. | 18 12 18 .8618261 .6687715 | 7. | 4 4 29 -.239354 .6361541 | 8. | 17 14 2 .516549 -.6399707 | 9. | 28 15 19 -1.812016 -1.628398 | 10. | 44 2 23 -1.015124 -.7855705 | +-------------------------------------------------------+ 2. 创建唯一代码 (Unique ID) 的两种方式2.1 用 egen 命令创建 Unique ID创建 Unique ID 最简单的方法就是使用 . egen ID = group(household_ID city_ID state_ID) . list in 1/10 +---------------------------------------+ | househ~D state_ID city_ID ID | |---------------------------------------| 1. | 18 28 10 6890 | 2. | 14 8 11 5296 | 3. | 7 21 17 2724 | 4. | 2 28 4 469 | 5. | 44 36 8 17091 | |---------------------------------------| 6. | 18 18 12 6925 | 7. | 4 29 4 1253 | 8. | 17 2 14 6553 | 9. | 28 19 15 10945 | 10. | 44 23 2 16966 | +---------------------------------------+ 通过 . sort household_ID city_ID state_ID \\排序 . list household_ID city_ID state_ID ID in 1/10 +------------------------------------+ | househ~D city_ID state_ID ID | |------------------------------------| 1. | 1 1 2 1 | 2. | 1 1 5 2 | 3. | 1 1 7 3 | 4. | 1 1 10 4 | 5. | 1 1 10 4 | |------------------------------------| 6. | 1 1 11 5 | 7. | 1 1 15 6 | 8. | 1 1 15 6 | 9. | 1 1 20 7 | 10. | 1 1 21 8 | +------------------------------------+ 2.2 创建字符型 Unique ID另一种创建 Unique ID 变量的方式是创建一个字符 (string) 变量。虽然这种方式相对复杂,但所产生的 Unique ID 变量也更直接,容易识别。 具体命令如下: . gen ID3 = "H" + string(household_ID,"%2.0f" ) + "C" + string(city_ID) + "S" + string(state_ID) . list household_ID city_ID state_ID ID3 in 1/10 +-------------------------------------------+ | househ~D city_ID state_ID ID3 | |-------------------------------------------| 1. | 18 10 28 H18C10S28 | 2. | 14 11 8 H14C11S8 | 3. | 7 17 21 H7C17S21 | 4. | 2 4 28 H2C4S28 | 5. | 44 8 36 H44C8S36 | |-------------------------------------------| 6. | 18 12 18 H18C12S18 | 7. | 4 4 29 H4C4S29 | 8. | 17 14 2 H17C14S2 | 9. | 28 15 19 H28C15S19 | 10. | 44 2 23 H44C2S23 | +-------------------------------------------+ 3. 处理重复观测由于数据中存在重复观测值,创建的唯一代码 (Unique ID) 其实也并不唯一。 当运行 . sum ID Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ID | 25,000 9890.821 5709.989 1 19757 . duplicates report ID Duplicates in terms of ID -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 15298 0 2 | 7528 3764 3 | 1836 1224 4 | 312 234 5 | 20 16 6 | 6 5 -------------------------------------- 此时,应当确定一个处理 collapse (mean) mean_x1=x1 mean_x2=x2 (median) /// med_x1=x1 med_x2=x2, by(ID) 对于重复观测值,该命令运行后会生成新变量,保留每个 ID 重复观测值的平均值以及中位数。数据结构如下 . list in 1/10 +----------------------------------------------------+ | ID mean_x1 mean_x2 med_x1 med_x2 | |----------------------------------------------------| 1. | 1 .1792767 .7344462 .1792767 .7344462 | 2. | 2 1.093224 -.1823799 1.093224 -.1823799 | 3. | 3 -.3094974 2.653506 -.3094974 2.653506 | 4. | 4 .4993488 .8985389 .4993488 .8985389 | 5. | 5 -.7429997 -.0464208 -.7429997 -.0464208 | |----------------------------------------------------| 6. | 6 .4437077 -.5161228 .4437077 -.5161228 | 7. | 7 -.7664803 1.15589 -.7664803 1.15589 | 8. | 8 .59837 .1051743 .59837 .1051743 | 9. | 9 .9568613 -.7659643 .9568613 -.7659643 | 10. | 10 -.8682789 -1.468759 -.8682789 -1.468759 | +----------------------------------------------------+ 此外,处理后,可以发现数据中已经不存在重复观测值了。 . duplicates report ID Duplicates in terms of ID -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 19757 0 -------------------------------------- |
|