Konferenzbeitrag

Removing spam from web corpora through supervised learning using FastText

Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora. The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.

Urheber*in: Suchomel, Vít

Namensnennung - Nicht kommerziell - Keine Bearbeitungen 4.0 International

Sprache: Englisch

Thema: Korpus <Linguistik>
Internet
Texttechnologie
Datenmanagement
Sprache

Ereignis: Geistige Schöpfung

(wer): Suchomel, Vít

Ereignis: Veröffentlichung

(wer): Mannheim : Institut für Deutsche Sprache

(wann): 2017-07-06

URN: urn:nbn:de:bsz:mh39-62674

Letzte Aktualisierung: 06.03.2025, 09:00 MEZ

Datenpartner

Dieses Objekt wird bereitgestellt von:
Leibniz-Institut für Deutsche Sprache - Bibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.

Original beim Datenpartner anzeigen

Objekttyp

Konferenzbeitrag

Beteiligte

Suchomel, Vít
Mannheim : Institut für Deutsche Sprache

Entstanden

2017-07-06

Ähnliche Objekte (12)

Removing spam from web corpora through supervised learning using FastText

Graffiti | Streetart

SPaM | SPAM | 2011 | USSELS

Graffiti | Streetart

SPAM

Anthologie | Erzählende Literatur: Anthologien

Spam!

zweidimensionales bewegtes Bild

Removing barriers

Krimis, Thriller, Spionage

Tödlicher Spam

Graffiti | Streetart

ASIACREW SPAM

SpamAssassin : [the open source solution to Spam ; covers SpamAssassin version 3.0]

Graffiti | Streetart

STUR SPaM

Akten

Absender, Sk - Spam

Graffiti | Streetart

SPAM. SPIDER. NASH.

Supervised and unsupervised ensemble learning for the semantic web