PA153: Stylometric analysis of texts using machine learning techniques Jan Rygl rygl@fi.muni.cz NLP Centre, Faculty of Informatics, Masaryk University Dec 7, 2016 Stylometry tylometry Stylometry is the application of the study of linguistic style, Study of linguistic style: o Find out text features. • Define author's writeprint. Applications: • Define the author (person, nationality, age group, ...). • Filter out text features not usuable by selected application. Examples of application: o Authorship recognition o Legal documents (verify the author of last will) o False reviews (cluster accounts by real authors) o Public security (find authors of anonymous illegal documents and threats) o School essays authorship verification (co-authorship) o Supportive authentication, biometrics (e-learning) • Age detection (pedophile recognition on children web sites). • author mother language prediction (public security). o Mental disease symptons detection (health prevention) o HR applications (find out personal traits from text) • Automatic translation recognition. Stylometry analysis techniques O ideological and thematic analysis historical documents, literature Q documentary and factual evidence inquisition in the Middle Ages, libraries Q language and stylistic analysis - Q nnanual (legal, public security and literary applications) O semi-automatic (same as above) O automatic (false reviews and generally all online stylometry applications) Stylometry tylometry Verification Definition f— ^ ? < > v-- J q decide if two documents were written by the same author category (lvl) • decide if a document was written by the signed author category (lvN) Examples • The Shakespeare authorship question q The verification of wills Mendenhall, T. C. 1887. The Characteristic Curves of Composition. Science Vol 9: 237-49. 9 The first algorithmic analysis o Calculating and comparing histograms of word lengths Oxford, Bacon Derby, Marlowe http://en.wikipedia.org/wiki/File:ShakespeareCandidates1.j pg Stylometry tylometry Attribution Definition Li_j L_j ? J L_I o find out an author category of a document o candidate authors' categories can be known (e.g. age groups, healthy/unhealthy person) • problems solving unknown candidate authors's categories are hard (e.g. online authorship, all clustering tasks) Examples o Anonymous e-mails Judiciary The police falsify testimonies Morton, A. Q. Word Detective Proves the Bard wasn't Bacon. Observer, 1976. Evidence in courts of law in Britain, U.S., Australia Expert analysis of courtroom discourse, e.g. testing "patterns of deceit" hypotheses Stylometry LP Centre stylometry research Authorship Recognition Tool 9 Ministry of the Interior of CR within the project VF20102014003 • Best security research award by Minister of the Interior Small projects (bachelor and diploma theses, papers) • detection of automatic translation, gender detection, . .. Text Miner q multilingual stylometry tool + many other features not related to stylometry • authorship, mother language, age, gender, social group detection Updated definition techniques that allow us to find out information about the authors of texts on the basis of an automatic linguistic analysis Stylometry process steps O data acquisition - obtain and preprocess data O feature extraction methods - get features from texts Q machine learning - train and tune classifiers O interpretation of results - make machine learning reasoning readable by human o Enron e-mail corpus • Blog corpus (Koppel, M, Effects of Age and Gender on Blogging) Manually annotated corpora Q UCNK school essays Techniques ata acquisition - preprocessing Tokenization, morphology annotation and desambiguation o morphological analysis je byt spor spor mezi mezi Severem sever a a Jihem jih • • Jde jit k5eAaImIp3nS klglnScl k7c7 klgInSc7 k8xC klgInSc7 klx. k5eAaImIp3nS Techniques Selection of feature extraction methods Categories o Morphological 9 Syntactic a Vocabulary o Other Analyse problem and select only suitable features. Combine with automatic feature selection techniques (entropy). Techniques Tuning of feature extraction methods Tuning process Divide data into three independet sets: o Tuning set (generate stopwords, part-of-speech n-grams, . ..) o Training set (train a classifier) • Test set (evaluate a classifier) Techniques eatures examples Word length statistics o Count and normalize frequencies of selected word lengths (eg. 1-15 characters) 9 Modification: word-length frequencies are influenced by adjacent frequencies in histogram, e.g.: l: 30°/., 2: 70°/., 3: o0/. is more Similar tO 1: 70°/., 2: 30°/., 3: 0°/. than 1: 0°/., 2: 60°/., 3: 40°/. Sentence length statistics 9 Count and normalize frequencies of o word per sentence length o character per sentence length Techniques eatures examples Stopwords o Count normalized frequency for each word from stopword list • Stopword ^ general word, semantic meaning is not important, e.g. prepositions, conjunctions, .. . • stopwords ten, by, člověk, že are the most frequent in selected v five texts of Karel Capek Wordclass (bigrams) statistics o Count and normalize frequencies of wordclasses (wordclass bigrams) o verb is followed by noun with the same frequency in selected five v texts of Karel Capek o Count and normalize frequencies of selected morphological tags 9 the most consistent frequency has the genus for family and v archaic freq in selected five texts of Karel Capek Word repetition • Analyse which words or wordclasses are frequently repeated through the sentence • nouns, verbs and pronous are the most repetetive in selected five v texts of Karel Capek Techniques eatures examples Extract features using SET (Syntactic Engineering Tool) I I I Verifikujeme I autorstvi Se I I syntaktickou analýzou Syntactic analysis .................................. "■ !k2 ^tltS&fS&ž^*1 a * a a a a a a a a a ň a a a a a a n a syntactic trees have similar depth in selected five texts of Karel v Capek Techniques Features examples Other stylometric features a typography (number of dots, spaces, emoticons, ...) 9 errors q vocabulary richness Techniques eatures examples Implementation features = (u'kA', u'kY', u'kl', u'k?', u'kO', u'kl', u'k2', u'k3', u'k4', u'k5', u'k6', u'k7', u'k8', u'k9') def document_to_features(self, document): """Transform document to tuple of float features. ©return: tuple of n float feature values, n= I get_f eatures |111111 ii ii ii ii features = np.zeros(self.features_count) sentences = self . get_structure(document, mode^'tag') for sentence in sentences: for tag in sentence: if tag and tag[0] == u'k': key = self.tag_to_index.get(tag[:2]) if key: features [key] += 1. total = np.sum(features) if total > 0: return features / total else: return features Tools • use frameworks over your own implementation (ML is HW consuming and needs to be optimal) o programming language doesn't matter, but high-level languages can be better (readability is important and performance is not affected - ML frameworks use usually C libraries) • for Python, good choice is Scikit-learn (http://scikit-learn.org) Machine learning tuning • try different machine learning techniques (Support Vector Machines, Random Forests, Neural Networks) o use grid search/random search/other heuristic searches to find optimal parameters (use cross-validation on train data) o but start with the fast and easy to configure ones (Naive Bayes, Decision Trees) • feature selection (more is not better) o make experiments replicable (use random seed), repeat experiments with different seed to check their performance o always implement a baseline algorithm (random answer, constant answer) Techniques achine learning tricks Replace feature values by ranking of feature values Book: long coherent text Blog: medium-length text E-mail: short noisy text r^n- ■ ^ ... .i"«-t- • Different "document conditions" are considered • Attribution: replace similarity by ranking of the author against other authors • Verification: select random similar documents from corpus and replace similarity by ranking of the document against these selected documents Techniques nterpretation of results Machine learning readable Explanation of ML reasoning can be important. We can O not to interpret data at all (we can't enforce any consequences) Q use one classifier per feature category and use feature categories results as a partially human readable solution Q use ML techniques which can be interpreted: o Linear classifiers each feature f has weight w(f) and document value val(f), w(f)* val(f) > threshold feF o Extensions of black box classifiers, for random forests https://github.com/janrygl/treeinterpreter O use another statistical module not connected to ML at all erformance (Czech texts) Balanced accuracy: Current (CS) —>► Desired (EN) Verification: books essays newspapers blogs letters e-mails discussions sms o books, essays: 95 % —>► 99 % o blogs, articles: 70% -> 90% Attribution (depends on the number of candidates, comparison on blogs): • up to 4 candidates: 80% -» 95% o up to 100 candidates: 40% -> 60% Clustering: o the evaluation metric depends on the scenario (50-60%) Results I want to try it myself How to start • Select a problem • Collect data (gender detection data are easy to find - crawler dating service) 9 Preprocess texts (remove HTML, tokenize) 9 Write a few feature extraction methods • Use a ML framework to classify data Results I want to try it really quick Quick start Style & Identity Recognizer https://github.com/j anrygl/sir. 9 In development, but functional. 9 Contains data from dating services. • Contains feature extractors. 9 Uses free RFTagger for morphology tagging. Results Development at Fl Text Miner 9 more languages, 9 more feature extractors, • more machine learning experiments, 9 better visualization, • and much more hank you for your attention Savage CfUc&etw v.i»n.^£Vdv?t'tifcfceii!t.curii