0%

波士顿房价预测练习

最终结果

我最终的尝试,在kaggle上达到此成绩,成绩判断标注是RMSE:

1


读数据

我首先采用B站教程的numpy 读数据。后来GPT建议我使用pandas 读数据,于是我学习pandas ,并修改成:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class BostonHousingDataset(Dataset):
def __init__(self, filepath, is_test=False):
self.is_test = is_test
if is_test:
df = pd.read_csv(filepath)
self.IDs = df.iloc[:, 0]
self.x_data = torch.tensor(df.iloc[:, 1:].values, dtype=torch.float32)
self.len = self.x_data.shape[0]
else:
xy = pd.read_csv(filepath)
self.len = xy.shape[0]
self.x_data = torch.tensor(xy.iloc[:, 1:-1].values, dtype=torch.float32)
self.y_data = torch.tensor(xy.iloc[:, [-1]].values, dtype=torch.float32)

def __getitem__(self, index):
if not self.is_test:
return self.x_data[index], self.y_data[index]
else:
return self.IDs[index], self.x_data[index]

def __len__(self):
return self.len

我认为,熟悉某个函数、方法的英文全称很利于熟悉和记忆,我搜索它们的本意:

  • df: data frame
  • df.iloc: integer location

getitem实现索引,len实现多少条数据 ,这两个魔法方法在

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=0)


test_loader = DataLoader(dataset=test_dataset,
batch_size=1,
shuffle=False)


for i, data in enumerate(train_loader, 0):
inputs, labels = data


for ID, inputs in test_loader:

四个地方都有用。


建立模型

现在开始建立模型。我遇到了很好玩的问题,处理的过程也很令人着迷。

我采用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(13, 64)
self.fc2 = nn.Linear(64, 32)
self.fc3 = nn.Linear(32, 16)
self.fc4 = nn.Linear(16, 1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.relu(self.fc3(x))
return self.fc4(x)

效果还行。但是,我之前的model可不是这样的,而是

1
2
3
4
x = self.rule(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.sigmoid(self.fc3(x))
return self.fc4(x)

可以看到我把sigmoid放在了倒数第二层 ,提交后,RMSE直接爆了:

2

这让我非常怀疑人生。我询问ChatGPT,它批评我说:

🚨 sigmoid可能限制了输出范围

1
x = self.sigmoid(self.fc3(x))
  • Sigmoid 的输出是 [0, 1],但是波士顿房价的 medv 真实值范围是 [5, 50](单位:千美元)
  • 所以你这个模型就算预测得再准,也只能输出 0~1 的值,完全偏离实际房价数值区间

🔧 解决方案: 把 sigmoid 去掉,或者用更合理的非线性函数(比如 ReLU):

1
x = self.relu(self.fc3(x))

ChatGPT虽然没有注意到我返回的不是sigmoid而是一个16到1的全连接,但是我觉得它说的很有道理,因为我的Sigmund距离output实在太近,最后一个连接层的任务就会是将01变为550,直观感受上,这肯定是很别扭的。于是我把sigmoid放在第一层,其他层用relu激活,最后的RMSE效果还可以~


生成文件

最后是生成提交数据,我这一段很不熟悉:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
test_dataset = BostonHousingDataset('test.csv', True)
test_loader = DataLoader(dataset=test_dataset,
batch_size=1,
shuffle=False)

model.eval()
predictions = []
IDs = []

with torch.no_grad():
for ID, inputs in test_loader:
inputs = inputs.to(device)
output = model(inputs)
predictions.append(output.item())
IDs.append(ID.item())

submission = pd.DataFrame({'ID': IDs, 'medv': predictions})
submission.to_csv('submission.csv', index=False)

需要反复再了解


Debug

还有一处有趣的bug是忘记在input后面加上.to(device)。修改过后,我翻开Cupa前辈的提交记录,发现Cupa也曾经忘记model.to(device),真是有趣的巧合。


源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
import pandas as pd

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class BostonHousingDataset(Dataset):
def __init__(self, filepath, is_test=False):
self.is_test = is_test
if is_test:
df = pd.read_csv(filepath)
self.IDs = df.iloc[:, 0]
self.x_data = torch.tensor(df.iloc[:, 1:].values, dtype=torch.float32)
self.len = self.x_data.shape[0]
else:
xy = pd.read_csv(filepath)
self.len = xy.shape[0]
self.x_data = torch.tensor(xy.iloc[:, 1:-1].values, dtype=torch.float32)
self.y_data = torch.tensor(xy.iloc[:, [-1]].values, dtype=torch.float32)

def __getitem__(self, index):
if not self.is_test:
return self.x_data[index], self.y_data[index]
else:
return self.IDs[index], self.x_data[index]

def __len__(self):
return self.len


class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(13, 64)
self.fc2 = nn.Linear(64, 32)
self.fc3 = nn.Linear(32, 16)
self.fc4 = nn.Linear(16, 1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.relu(self.fc3(x))
return self.fc4(x)

dataset = BostonHousingDataset('train.csv')

train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=0)



model = Model().to(device)

criterion = nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

def train(epoch):
model.train()
for i, data in enumerate(train_loader, 0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels.view(-1, 1)) # .view调整label形状

optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Epoch {epoch + 1}/{epochs}, Loss: {loss}")

if __name__ == '__main__':
epochs = 100
for epoch in range(epochs):
train(epoch)

test_dataset = BostonHousingDataset('test.csv', True)
test_loader = DataLoader(dataset=test_dataset,
batch_size=1,
shuffle=False)

model.eval()
predictions = []
IDs = []

with torch.no_grad():
for ID, inputs in test_loader:
inputs = inputs.to(device)
output = model(inputs)
predictions.append(output.item())
IDs.append(ID.item())

submission = pd.DataFrame({'ID': IDs, 'medv': predictions})
submission.to_csv('submission.csv', index=False)