地球资源数据云——数据资源详情

人类对话训练数据

发布时间:2026-03-17 14:30:51资源ID:2032006903956410370资源类型:免费

该数据集《Human Conversation training data》主要用于监督学习任务,数据形态以文本为主,应用场景偏向安全检测。 题目说明:Training data aggregated from various sources for training a chatbot with NLP. 任务类型:文本监督学习。 建议流程:先做文本清洗与分词,再比较 TF - IDF+线性模型 与 预训练语言模型。 评估建议:使用分层切分或交叉验证,优先关注 F1、Recall、AUC 等分类指标。 可用文件:未检测到标准 CSV,可优先查看目录中的索引或说明文件。 Context I was working with RNN models in Tensorflow and was searching about conversation bots. Then a idea struck me as to create a bot myself. I looked for chat data but was not able to find something useful. Then I came across Meena chatbot and Mitsoku chatbot data and so compiled them with some data from human chats corpus. Content The data corpus contain chat labelled chat data with Human 1 and Human 2 in ask - reponse manner. Each odd row with Human 1 label is the initiator of the chat and each even row with Human 2 label is the response. Data after Human x: is the chat data which can be preprocessed to remove the label part. Acknowledgements

人类对话训练数据

摘要概览

该数据集《Human Conversation training data》主要用于监督学习任务,数据形态以文本为主,应用场景偏向安全检测。 题目说明:Training data aggregated from various sources for training a chatbot with NLP.

任务类型:文本监督学习。

建议流程:先做文本清洗与分词,再比较 TF - IDF+线性模型 与 预训练语言模型。

评估建议:使用分层切分或交叉验证,优先关注 F1、Recall、AUC 等分类指标。

可用文件:未检测到标准 CSV,可优先查看目录中的索引或说明文件。

Context

I was working with RNN models in Tensorflow and was searching about conversation bots. Then a idea struck me as to create a bot myself. I looked for chat data but was not able to find something useful. Then I came across Meena chatbot and Mitsoku chatbot data and so compiled them with some data from human chats corpus.

Content

The data corpus contain chat labelled chat data with Human 1 and Human 2 in ask - reponse manner. Each odd row with Human 1 label is the initiator of the chat and each even row with Human 2 label is the response. Data after Human x: is the chat data which can be preprocessed to remove the label part.

Acknowledgements