当前位置：首页 > news >正文

qq空间认证的网站后台根目录html5网站网址

news 2025/11/16 23:12:03

qq空间认证的网站后台根目录,html5网站网址,网页和网站是一样的吗,网页编辑软件朱提示#xff1a;文章写完后#xff0c;目录可以自动生成#xff0c;如何生成可参考右边的帮助文档文章目录前言1. 介绍1.1 集成学习中boosting和Bagging2. Bagging和随机森林2.1 随机森林构造过程2.2 包外估计2.3 随机森林api介绍2.4 随机森林预测案例3. otto案例介绍4.1 流… 提示文章写完后目录可以自动生成如何生成可参考右边的帮助文档文章目录前言1. 介绍1.1 集成学习中boosting和Bagging2. Bagging和随机森林2.1 随机森林构造过程2.2 包外估计2.3 随机森林api介绍2.4 随机森林预测案例3. otto案例介绍4.1 流程分析4.2 数据基本处理4.3 模型基本训练4.4 模型调优4.5 生成提交数据4. Boosting5. GBDT介绍总结13.03前言 1. 介绍集成学习通过建⽴⼏个模型来解决单⼀预测问题。它的⼯作原理是⽣成多个分类器/模型各⾃独⽴地学习和作出预测。这些预测最后结合成组合预测因此优于任何⼀个单分类的做出预测 1.1 集成学习中boosting和Bagging 只要单分类器的表现不太差集成学习的结果总是要好于单分类器的 2. Bagging和随机森林 2.1 随机森林构造过程 2.2 包外估计 2.3 随机森林api介绍 sklearn.ensemble.RandomForestClassifier(n_estimators10, criterion’gini’, max_depthNone, bootstrapTrue,random_stateNone, min_samples_split2)n_estimators表示是多少棵决策树集成森林 Criterionstring可选(default “gini”) 分割特征的测量⽅法 max_depthinteger或None可选(默认⽆) 树的最⼤深度 5,8,15,25,30 max_featuresauto”,每个决策树的最⼤特征数量 If auto, then max_featuressqrt(n_features) . If sqrt, then max_featuressqrt(n_features) (same as auto). If log2, then max_featureslog2(n_features) . If None, then max_featuresn_features .bootstrapbooleanoptional(default True) 是否在构建树时使⽤放回抽样 min_samples_split 内部节点再划分所需最⼩样本数这个值限制了⼦树继续划分的条件如果某节点的样本数少于min_samples_split则不会继续再尝试选择最优特征来进⾏划分默认是2。如果样本量不⼤不需要管这个值。如果样本量数量级⾮常⼤则推荐增⼤这个值。 min_samples_leaf 叶⼦节点的最⼩样本数这个值限制了叶⼦节点最少的样本数如果某叶⼦节点数⽬⼩于样本数则会和兄弟节点⼀起被剪枝默认是1。叶是决策树的末端节点。较⼩的叶⼦使模型更容易捕捉训练数据中的噪声。 min_impurity_split: 节点划分最⼩不纯度这个值限制了决策树的增⻓如果某节点的不纯度(基于基尼系数均⽅差)⼩于这个阈值则该节点不再⽣成⼦节点。即为叶⼦节点上⾯决策树参数中最重要的包括最⼤特征数max_features 最⼤深度max_depth 内部节点再划分所需最⼩样本数min_samples_split 叶⼦节点最少样本数min_samples_leaf。 2.4 随机森林预测案例还是选择的原来的泰坦尼克号数据 #实例化一个随机森林 from sklearn.ensemble import RandomForestClassifier rf RandomForestClassifier()#通过超参数调优 from sklearn.model_selection import GridSearchCV param {n_estimators:[100,120,300],max_depth:[3,7,10]} gc GridSearchCV(rf,param_gridparam,cv3) gc.fit(x_train,y_train)n_estimators控制随机森林中决策树的数量这里尝试 100、120、300 三档。树越多通常性能越好但计算成本也越高 max_depth控制每棵决策树的最大深度这里尝试 3、7、10 三档。深度越小模型越简单不易过拟合深度越大模型越复杂可能过拟合 param_gridparam, # 要搜索的超参数组合网格搜索GridSearchCV 交叉验证cv3 这个意思就是利用网格搜索找出最好的随机森林从三个中找出来gc print(随机森林预测结果是:\n,gc.score(x_test,y_test))Bagging 决策树/线性回归/逻辑回归/深度学习… bagging集成学习⽅法经过上⾯⽅式组成的集成学习⽅法: 均可在原有算法上提⾼约2%左右的泛化正确率简单, ⽅便, 通⽤ 3. otto案例介绍 4.1 流程分析获取数据数据基本处理数据量⽐较⼤尝试是否可以进⾏数据分割转换⽬标值表示⽅式模型训练模型基本训练 import numpy as np import pandas as pd import matplotlib.pyplot as plt#图像可视化查看数据分布 import seaborn as sns sns.countplot(xdata.target) plt.show()我们发现数据不平衡类别不均衡–》纵坐标为数量 4.2 数据基本处理 #数据基本处理数据已经脱敏不在需要特殊处理 #数据截取 new1_data data[:10000] new1_data这样直接截取前面10000是肯定不可以的 #数据基本处理数据已经脱敏不在需要特殊处理 #数据截取:随机欠采样 y data[target] xdata.drop([id,target],axis1)y就是目标值了 x就少了两列了 #欠采样获取数据 from imblearn.under_sampling import RandomUnderSampler rus RandomUnderSampler(random_state42) x_res, y_res rus.fit_resample(x, y)#图像可视化查看数据分布 import seaborn as sns sns.countplot(xy_res) plt.show()欠采样是通过减少多数类样本数量使不同类别样本数量趋于平衡的方法 #将Class_1这种标签值改为数字 from sklearn.preprocessing import LabelEncoder le LabelEncoder() y_res le.fit_transform(y_res) y_res拟合fit 分析输入的标签数据y_res提取其中所有唯一的类别比如 Class_1、Class_2、Class_3 等。为每个唯一类别分配一个对应的整数通常从 0 开始依次编号建立 “类别→数字” 的映射关系。例如 Class_1 → 0 Class_2 → 1 Class_3 → 2 转换transform 根据第一步建立的映射关系将原始标签数据如 Class_1、Class_2 等转换为对应的数字。最终输出全是数字的标签数组y_res方便机器学习模型处理大多数模型只能接受数值型输入 #分割数据测试集和训练集 from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test train_test_split(x_res, y_res, test_size0.2, random_state42)4.3 模型基本训练 #模型基本训练 from sklearn.ensemble import RandomForestClassifier rfRandomForestClassifier(oob_score True) rf.fit(x_train,y_train)oob_scoreTrue 是一个非常实用的参数其中 OOB 是 Out-of-Bag袋外样本随机森林在构建每棵决策树时会通过 Bootstrap 抽样有放回的随机抽样从训练集中抽取部分样本作为该树的训练数据。那些未被抽到的样本就称为这棵树的 “袋外样本”OOB 样本。由于是随机抽样每棵树都会有自己的 OOB 样本整体上约有 37% 的样本会成为某棵树的 OOB 样本数学上的近似结果。当设置 oob_scoreTrue 时模型会在训练完成后用每棵树的 OOB 样本作为 “验证集”对该树进行预测综合所有树的预测结果对 OOB 样本的预测计算出一个整体的模型性能得分默认是准确率分类问题。这个得分被称为 OOB 分数可以通过 rf.oob_score_ 查看 rf.oob_score_y_prerf.predict(x_test) print(y_pre)rf.score(x_test,y_test)#图像可视化查看数据分布 import seaborn as sns sns.countplot(xy_pre) plt.show()但是我们要用这个logloss进行评估才可以 #logloss模型评估 from sklearn.metrics import log_loss log_loss(y_test,y_pre,epsilon1e-15,normalizeTrue)epsilon1e-15用于数值稳定性的极小值。因为对数损失计算中包含 log§ 操作当 p 接近 0 时log§ 会趋于无穷大。这个参数会将预测概率限制在 [epsilon, 1-epsilon] 范围内避免计算错误。 normalizeTrue是否对损失值进行归一化。若为 True默认返回的是平均对数损失总损失除以样本数量。若为 False返回的是总对数损失所有样本的损失之和。但是注意第二个参数必须是oneHot编码的显示的所以这样是不行的 from sklearn.preprocessing import OneHotEncoder onehot OneHotEncoder(sparse_output False) y_test1onehot.fit_transform(y_test.reshape(-1,1)) y_pre1onehot.transform(y_pre.reshape(-1,1))sparse_outputFalse设置输出为密集数组而非稀疏矩阵这样得到的编码结果是常规的 NumPy 数组更直观易处理 #logloss模型评估 from sklearn.metrics import log_loss log_loss(y_test1,y_pre1,normalizeTrue)这样就求出来了可以把logloss变小呢 #通过预测值改变预测值的输出模式让输出结果为百分占比降低logloss y_pre_probarf.predict_proba(x_test) y_pre_proba这个就是预测出现每个类别的可能性是多少就不是直接预测可能是哪个类别然后预测是每个类别的可能性是多少 log_loss(y_test1,y_pre_proba,normalizeTrue)这个logloss直接就变成零点几了 4.4 模型调优 #模型调优--超参数 ##确定最优的n_estimators tuned_parameters range(10,200,10) #创建添加accuracy的numpy accuracy_t np.zeros(len(tuned_parameters)) #创建添加error的numpy error_t np.zeros(len(tuned_parameters)) #调优过程实现 for i,n_estimators in enumerate(tuned_parameters):rf2 RandomForestClassifier(n_estimatorsn_estimators,max_depth10,max_features10,min_samples_leaf10,random_state0,n_jobs-1,oob_scoreTrue)rf2.fit(x_train,y_train)accuracy_t[i] rf2.oob_score_y_prerf2.predict_proba(x_test)error_t[i]log_loss(y_test1,y_pre,normalizeTrue)print(error_t)我们直接看最后的输出发现logloss是一直在变小的 #优化结果过程可视化 fig,axesplt.subplots(nrows1,ncols2,figsize(20,4),dpi100) axes[0].plot(tuned_parameters,error_t) axes[0].set_xlabel(n_estimators) axes[0].set_ylabel(error_t) axes[0].grid(True) axes[1].plot(tuned_parameters,accuracy_t) axes[1].set_xlabel(n_estimators) axes[1].set_ylabel(accuracy_t) axes[1].grid(True) plt.show()所以经过图像展示发现n_estimators175就和合适logloss就很小然后accuracy_t很高 #模型调优--超参数 ##确定最优的max_features tuned_parameters range(5,40,5) #创建添加accuracy的numpy accuracy_t np.zeros(len(tuned_parameters)) #创建添加error的numpy error_t np.zeros(len(tuned_parameters)) #调优过程实现 for i,n_estimators in enumerate(tuned_parameters):rf2 RandomForestClassifier(n_estimators175,max_depth10,max_featuresn_estimators,min_samples_leaf10,random_state0,n_jobs-1,oob_scoreTrue)rf2.fit(x_train,y_train)accuracy_t[i] rf2.oob_score_y_prerf2.predict_proba(x_test)error_t[i]log_loss(y_test1,y_pre,normalizeTrue)print(error_t)#优化结果过程可视化 fig,axesplt.subplots(nrows1,ncols2,figsize(20,4),dpi100) axes[0].plot(tuned_parameters,error_t) axes[0].set_xlabel(max_features) axes[0].set_ylabel(error_t) axes[0].grid(True) axes[1].plot(tuned_parameters,accuracy_t) axes[1].set_xlabel(max_features) axes[1].set_ylabel(accuracy_t) axes[1].grid(True) plt.show()发现max_features15的时候就很好了 #模型调优--超参数 ##确定最优的max_depth tuned_parameters range(10,100,10) #创建添加accuracy的numpy accuracy_t np.zeros(len(tuned_parameters)) #创建添加error的numpy error_t np.zeros(len(tuned_parameters)) #调优过程实现 for i,n_estimators in enumerate(tuned_parameters):rf2 RandomForestClassifier(n_estimators175,max_depthn_estimators,max_features15,min_samples_leaf10,random_state0,n_jobs-1,oob_scoreTrue)rf2.fit(x_train,y_train)accuracy_t[i] rf2.oob_score_y_prerf2.predict_proba(x_test)error_t[i]log_loss(y_test1,y_pre,normalizeTrue)print(error_t)#优化结果过程可视化 fig,axesplt.subplots(nrows1,ncols2,figsize(20,4),dpi100) axes[0].plot(tuned_parameters,error_t) axes[0].set_xlabel(max_depth) axes[0].set_ylabel(error_t) axes[0].grid(True) axes[1].plot(tuned_parameters,accuracy_t) axes[1].set_xlabel(max_depth) axes[1].set_ylabel(accuracy_t) axes[1].grid(True) plt.show()发现max_depth30刚刚好 min_samples_leaf 是一个重要的超参数用于控制决策树叶子节点的最小样本数量当 min_samples_leaf1 时允许叶子节点只包含 1 个样本最灵活但可能过拟合当 min_samples_leaf10 时要求每个叶子节点至少包含 10 个样本更保守不易过拟合 #模型调优--超参数 ##确定最优的min_samples_leaf tuned_parameters range(1,10,2) #创建添加accuracy的numpy accuracy_t np.zeros(len(tuned_parameters)) #创建添加error的numpy error_t np.zeros(len(tuned_parameters)) #调优过程实现 for i,n_estimators in enumerate(tuned_parameters):rf2 RandomForestClassifier(n_estimators175,max_depth30,max_features15,min_samples_leafn_estimators,random_state0,n_jobs-1,oob_scoreTrue)rf2.fit(x_train,y_train)accuracy_t[i] rf2.oob_score_y_prerf2.predict_proba(x_test)error_t[i]log_loss(y_test1,y_pre,normalizeTrue)print(error_t)#优化结果过程可视化 fig,axesplt.subplots(nrows1,ncols2,figsize(20,4),dpi100) axes[0].plot(tuned_parameters,error_t) axes[0].set_xlabel(min_samples_leaf) axes[0].set_ylabel(error_t) axes[0].grid(True) axes[1].plot(tuned_parameters,accuracy_t) axes[1].set_xlabel(min_samples_leaf) axes[1].set_ylabel(accuracy_t) axes[1].grid(True) plt.show()所以min_samples_leaf1 rf3 RandomForestClassifier(n_estimators175,max_depth30,max_features15,min_samples_leaf1,random_state0,n_jobs-1,oob_scoreTrue) rf3.fit(x_train,y_train)y_pre_proba1rf3.predict_proba(x_test) log_loss(y_test1,y_pre_proba)4.5 生成提交数据 #生成提交数据 test_datapd.read_csv(./source/test_oob.csv)test_data_drop_id test_data.drop([id],axis1) test_data_drop_idy_pre_testrf3.predict_proba(test_data_drop_id) y_pre_testresult_datapd.DataFrame(y_pre_test,columns[Class_str(i) for i in range(1,10)]) result_dataresult_data.insert(loc0,columnid,valuetest_data[id]) result_data然后就可以存储为csv文件了 result_data.to_csv(./source/otto_result.csv,indexFalse)当 indexFalse 时导出 CSV 文件时不会包含 DataFrame 的索引列只保存数据列注意这里的id不是索引 4. Boosting 随着学习的积累从弱到强简⽽⾔之每新加⼊⼀个弱学习器整体能⼒就会得到提升代表算法AdaboostGBDTXGBoostLightGBM 5. GBDT介绍 GBDT 的全称是 Gradient Boosting Decision Tree梯度提升树在传统机器学习算法中GBDT算的上TOP3的算法。想要理解GBDT的真正意义那就必须理解GBDT中的Gradient Boosting 和Decision Tree分别是什么总结13.03

查看全文

http://www.pierceye.com/news/655994/