Konferenzbeitrag

Removing spam from web corpora through supervised learning using FastText

Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora. The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.

Language
Englisch

Subject
Korpus <Linguistik>
Internet
Texttechnologie
Datenmanagement
Sprache

Event
Geistige Schöpfung
(who)
Suchomel, Vít
Event
Veröffentlichung
(who)
Mannheim : Institut für Deutsche Sprache
(when)
2017-07-06

URN
urn:nbn:de:bsz:mh39-62674
Last update
06.03.2025, 9:00 AM CET

Data provider

This object is provided by:
Leibniz-Institut für Deutsche Sprache - Bibliothek. If you have any questions about the object, please contact the data provider.

Object type

  • Konferenzbeitrag

Associated

  • Suchomel, Vít
  • Mannheim : Institut für Deutsche Sprache

Time of origin

  • 2017-07-06

Other Objects (12)