Konferenzbeitrag

Removing spam from web corpora through supervised learning using FastText

Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora. The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.

Urheber*in: Suchomel, Vít

Attribution - NonCommercial - NoDerivates 4.0 International

Language: Englisch

Subject: Korpus <Linguistik>
Internet
Texttechnologie
Datenmanagement
Sprache

Event: Geistige Schöpfung

(who): Suchomel, Vít

Event: Veröffentlichung

(who): Mannheim : Institut für Deutsche Sprache

(when): 2017-07-06

URN: urn:nbn:de:bsz:mh39-62674

Last update: 06.03.2025, 9:00 AM CET

Data provider

This object is provided by:
Leibniz-Institut für Deutsche Sprache - Bibliothek. If you have any questions about the object, please contact the data provider.

Show original at data provider

Object type

Konferenzbeitrag

Associated

Suchomel, Vít
Mannheim : Institut für Deutsche Sprache

Time of origin

2017-07-06

Other Objects (12)

Removing spam from web corpora through supervised learning using FastText

Graffiti | Streetart

SPaM | SPAM | 2011 | USSELS

Graffiti | Streetart

SPAM

Anthologie | Erzählende Literatur: Anthologien

Spam!

zweidimensionales bewegtes Bild

Removing barriers

Graffiti | Streetart

ASIACREW SPAM

Krimis, Thriller, Spionage

Tödlicher Spam

SpamAssassin : [the open source solution to Spam ; covers SpamAssassin version 3.0]

Graffiti | Streetart

STUR SPaM

Graffiti | Streetart

SPAM. SPIDER. NASH.

Akten

Absender, Sk - Spam

Supervised and unsupervised ensemble learning for the semantic web