地球资源数据云——数据资源详情

需要清理的脏数据 这个数据集有什么问题

发布时间:2026-03-17 14:31:44资源ID:2031264387585970177资源类型:免费

该数据集《Dirty data to clean What's wrong with this dataset》主要用于回归/预测任务,数据形态以表格为主。 题目说明:Animal data for data cleaning, visualization and geospatial analysis 任务类型:表格回归/预测。 建议流程:先做缺失值/异常值处理与特征编码,再比较逻辑回归、随机森林、XGBoost。 评估建议:使用分层切分或交叉验证,优先关注 F1、Recall、AUC 等分类指标。 可用文件:animal_data_dirty1.csv, animal_data_reworked.csv。 This dataset contains ~1000 lines with data about animals spotted in Central/Eastern in 2024 (animal types, country, geolocation - latitude/longitude, gender, estimated height and body length. The data was artificially - generated. The primary purpose of this dataset is data - cleaning; it can be used also for data visualization and geospatial analysis (e.g. with folium). This dataset has multiple issues, including: duplicates, missing data,

需要清理的脏数据 这个数据集有什么问题

摘要概览

该数据集《Dirty data to clean What's wrong with this dataset》主要用于回归/预测任务,数据形态以表格为主。 题目说明:Animal data for data cleaning, visualization and geospatial analysis

任务类型:表格回归/预测。

建议流程:先做缺失值/异常值处理与特征编码,再比较逻辑回归、随机森林、XGBoost。

评估建议:使用分层切分或交叉验证,优先关注 F1、Recall、AUC 等分类指标。

可用文件:animal_data_dirty1.csv, animal_data_reworked.csv。

This dataset contains ~1000 lines with data about animals spotted in Central/Eastern in 2024 (animal types, country, geolocation - latitude/longitude, gender, estimated height and body length.

The data was artificially - generated.

The primary purpose of this dataset is data - cleaning; it can be used also for data visualization and geospatial analysis (e.g. with folium). This dataset has multiple issues, including:

duplicates,

missing data,