When trying to hone your pure language processing (NLP) expertise, discovering accessible and related datasets will be one of many largest bottlenecks of the expertise. Numerous time will be spent attempting to find present datasets for the training job at hand, or trying to curate your personal knowledge as an alternative. It could be nice to have a centralized itemizing of accessible NLP datasets… would not it?
That is the place The Big Bad NLP Database (BBNLPDB), managed by Quantum Stat, is available in. In case you are in search of datasets to work in your NLP expertise, it’s best to undoubtedly take a look at.
BBNLPDB offers entry to just about 300 well-organized, sortable, and searchable pure language processing datasets.
Right here yow will discover datasets able to go for widespread NLP duties and desires, reminiscent of doc classification, query answering, automated picture captioning, dialog, clustering, intent classification, language modeling, machine translation, textual content corpora, and extra.
One downside is that a lot of the datasets are in English, although a number of Arabic, Chinese language, German, Dutch, and varied Indian language entries do exist, as do numerous multi-lingual datasets.
Do you’ve got a dataset that isn’t included within the itemizing? Let them know, and so they would possibly add it.
Earlier than anybody says it: positive, conveniently accessible and well-thought out datasets usually are not consultant of the true world. However that is not a priority for when you find yourself engaged on fine-tuning your technical expertise. Customary datasets are additionally nice for benchmarking, and the common units for all the varied forms of widespread duties can be found within the BBNLPDB as nicely.
Try the BBNLPDB your self in case you are available in the market for NLP datasets. On the very least, you would possibly discover a centralized location for accessing a number of the extra widespread and steadily used units within the area.