Tawosi Dataset

This directory consists 46 files, two files per each of the 26-3*=23 projects:

23 files with "_deep-se" suffix are prepared to be used by Deep-SE.
23 files with "_tfidf-se" suffix are prepared to be used by TF/IDF-SE.

* _{One of the repositories including three projects has been removed from the public domain during the time that the manuscript for this study [1] was under revision. Therefore, although the paper reports the results for all 26 projects, the replication package includes 23 projects as we refrain from publishing the data for the three remaining projects in accordance with The General Data Protection Regulation.}

These 23 files are collected from 12 open source repositories by Tawosi et al. up until August, 2020. The files named after their project key as "[project key]_[approach].csv" e.g. MESOS_deep-se.csv, which is the set of issues collected from Appache repository Mesos project, and contains the features that Deep-SE needs for prediction. The following table shows the list of projects and the repositories where the project was collected from.

Project list

Repository	Project	Key	File for Deep-SE	File for TF/IDF-SE
Apache	Mesos	MESOS	MESOS_deeep-se.csv	MESOS_tfidf-se.csv
Apache	Alloy	ALOY	ALOY_deeep-se.csv	ALOY_tfidf-se.csv
Appcelerator	Appcelerator studio	TISTUD	TISTUD_deeep-se.csv	TISTUD_tfidf-se.csv
Appcelerator	Aptana studio	APSTUD	APSTUD_deeep-se.csv	APSTUD_tfidf-se.csv
Appcelerator	Command-Line Interface	CLI	CLI_deeep-se.csv	CLI_tfidf-se.csv
Appcelerator	Daemon	DAEMON	DAEMON_deeep-se.csv	DAEMON_tfidf-se.csv
Appcelerator	Documentation	TIDOC	TIDOC_deeep-se.csv	TIDOC_tfidf-se.csv
Appcelerator	Titanium	TIMOB	TIMOB_deeep-se.csv	TIMOB_tfidf-se.csv
Atlassian	Clover	CLOV	CLOV_deeep-se.csv	CLOV_tfidf-se.csv
Atlassian	Confluence Cloud	CONFCLOUD	CONFCLOUD_deeep-se.csv	CONFCLOUD_tfidf-se.csv
Atlassian	Confluence Server and Data Center	CONFSERVER	CONFSERVER_deeep-se.csv	CONFSERVER_tfidf-se.csv
DNNSoftware	DNN	DNN	DNN_deeep-se.csv	DNN_tfidf-se.csv
Duraspace	Duracloud	DURACLOUD	DURACLOUD_deeep-se.csv	DURACLOUD_tfidf-se.csv
Hyperledger	Fabric	FAB	FAB_deeep-se.csv	FAB_tfidf-se.csv
Hyperledger	Sawtooth	STL	STL_deeep-se.csv	STL_tfidf-se.csv
Lsstcorp	Data management	DM	DM_deeep-se.csv	DM_tfidf-se.csv
MongoDB	Compass	COMPASS	COMPASS_deeep-se.csv	COMPASS_tfidf-se.csv
MongoDB	Core Server	SERVER	SERVER_deeep-se.csv	SERVER_tfidf-se.csv
MongoDB	Evergreen	EVG	EVG_deeep-se.csv	EVG_tfidf-se.csv
Moodle	Moodle	MDL	MDL_deeep-se.csv	MDL_tfidf-se.csv
Mulesoft	Mule	MULE	MULE_deeep-se.csv	MULE_tfidf-se.csv
Sonatype	Sonatype’s Nexus	NEXUS	NEXUS_deeep-se.csv	NEXUS_tfidf-se.csv
Spring	Spring XD	XD	XD_deeep-se.csv	XD_tfidf-se.csv

Content of the files

Each csv file for Deep-SE approach contains 4 columns: issuekey, created, title, description, and storypoint.
Each csv file for TF/IDF-SE approach contains more than 4 columns: starting with issuekey, created, storypoint, context, codesnippet, and a set of one-hot columns for issue type (header starting with t_) followed by component(s) (header starting with c_).
The issues are sorted based on issue's creation time (i.e. the former issues was created before the latter issues).

[1] Vali Tawosi, Rebecca Moussa, and Federica Sarro. "Agile Effort Estimation: Have We Solved the Problem Yet? Insights From A Replication Study." IEEE Transactions on Software Engineering, no. TBA (2022): pp. TBA.