Natural Language Processing &Textual Analysis in Finance & Accounting

so this is the key here the key to life is giving a co-author that has skill nobody follow this is I have those skills Tim Larkin airbrushed another day the person who have skills in textual analysis natural language processing in textual analysis in finance and accounting is Bill McDonald I will go over there and bill will get the talk and periodic and made me a comment or something I get but bill is really the workhorse for all of our papers in textual analysis I like to introduce bill so what Tim is politely said is his Island Tim Locke runs researches thank you for attending you don’t have a clicker here I know Tim and I presented we got into this five years ago or so and the first time we presented a paper at notre dame’s I’ll try to stand still the first time we present a paper at Notre Dame’s research seminar I started with this life and after doing the presentation it dawned on me that there were really two audiences in the group at our Notre Dame seminary service and garner the junior faculty already knew what textual analysis was and about half of them were doing it and the senior faculty and and the junior faculty had no idea where this quote came from and the senior faculty had no clue as to what textual analysis was but they all knew where the quote later of course led zepplin Tim’s the only senior faculty member to thought Harold Smith or somebody like that so the purpose of this presentation really is to address maybe the senior faculty group more than anything else if you have a PhD in computational linguistics I can’t answer your questions and I know Tim can yourself you’re coming the wrong way so we’re just going to walk you through and talk about the issues we’ve stumbled into as we’ve tried to learn about textual analysis ah the first thing I struggled with and kind of chuckled at as I watched papers crossed the desk on this topic is what exactly is it that we’re talking about it’s a relatively new topic so we don’t completely have the taxonomy down as to what we’re going to call this you’ll see and it’s developed you know it’s developed in sociology cyclic psychology somewhat separately from computer science political science everybody’s developed this on their own to some extent over the past 50 years in some cases so you’ll see if called textual analysis a lot of what I tried to do when I first tried to learn what this is all about is is I read up a line the area of natural language processing which usually comes more out of engineering probably textual analysis sentiment analysis and content analysis are most closely related you see those more coming out of the social sciences and then that’s one language processing and computational linguistics you’d probably see more commonly together I think our profession seems to be attached to tech textual analysis but that’s probably and sentiment analysis seems to be popular but I’m not exactly sure what to call what we’re doing here what we’ve seen or at least I perceive is is the the interest in this area has increased substantially in the past say five years and why is that and not just in finance but in other areas as well the the notion of textual analysis has been around I think for at least hundreds of years and I think a lot of it goes back to political science and the analysis of political speeches and looking for keywords and how often they’re used so I think the notion of doing this type of content analysis has been around for a very long time in stuff with you know with computers coming online in the 50s I think they saw that there might be potential for a lot of interest in this area and a lot of interesting outcomes what what happened was the initial results weren’t as promising there was I think a lot of grant money that went into this topic early on in the 50s and 60s and then a lot backed out because they really weren’t getting much in terms of interesting results probably because we were trying to do something like artificial intelligence and that’s very hard to do on big mainframe computers and so the topic really wasn’t all that fashionable for quite some period of time but certainly the past five years has taken off why we have the computers to do it now you need to be you know you need to have lots of storage you need to have lots of CPU we can do a lot of it on a reasonably beep

double machine more importantly if you look at when they first started doing this there just wasn’t a lot of digitized text out there so if you’ve got all these tools what are you going to do with her and now you know if you think that not only are there purposes which is what you tend to call it in the business large bodies have digitized text and we tend to lean on the SCC’s Edgar site a lot but the whole internet is just a bunch of digitized text also what makes it a little more interesting is a lot of new technologies are being developed in this area through search engines I mean the folks at Google and Microsoft and Yahoo spend a lot of time looking at if you’re searching for one particular topic how do I determine if this documents relevant even though it may not use the specific words that you use you know so they’re doing stuff like latent semantic indexing and those types of things to come up with these types of solutions so I think that’s why it’s become a much more popular topic in the past five years or so okay I’ll talk about data and programs next and and I would encourage you as we’re going through this to please feel free to jump in and ask questions that have in common oh it’s already indicated Gemini’s a lot of time playing with that ger because when I first started down this path it was because I had been you can tell by the color of my hair I’ve been in this business for a while you know and I’d been – standard faculty member using for trend – pound Compustat and crest for a long period of time and I just lost interest in that at some point and I thought there’s nothing there’s this huge amount of qualitative texts out there what can we do with that and I started playing around on this and Edgar just seemed like an easy target at that if you have some relationship with the Wall Street Journal folks you can actually get access to their news archive which is digitized XML encapsulated articles all the articles that went across the news where from I think 2004 word I had that data set we haven’t worked with it a lot you see people not doing audio transcript transcription where they’re not like working with the text but you know if you follow this some people are actually looking at the the audio fingerprints to look at are you under stress when you’re talking those types of things website certainly have text some of our colleagues in other damages or Google searches after looking at interest in particular securities which is very fashionable now is looking the truth and going through a course on stock twits you you know if you’re going to tweet on a stock you have the ticker symbol preceded by a dollar sign so I can go through a stream of tweets very quickly and say here’s somebody talking about Google and try and do some sentiment analysis some what they’re doing in that particular context so those are some examples you know I’m sure there are many more but there’s a lot more text out there digitized text and then there was 10 years ago Oh programs how do you approach this and again for for for the senior faculty that we’re in our audience there were a bunch of you know Fortran programs that do SAS instated and that was about it and much of this involves a different approach in terms of what you’re doing computationally and I you can get black box solutions and if there’s any single theme that Tim and I have or any single conclusion that Tim and I have come to and all the work that we’ve done is don’t use off-the-shelf technology whether it be word list whether it be programs or whatever why this is the results are many times driven by outliers we’ll discuss the specifics of that later on and so if you don’t know exactly what’s going into your recipe you’ll have no idea what’s coming you need to know what’s going on under the hood so don’t get a program like some of these that are available so you know I’ve got a package program you get me a quantity of text a bunch of documents I feed it into this black box and then I’ll give you measures of sentiment from how much you love your mother to anxiety and if you just run with those results and he’s reviewing the paper you’re in a lot of trouble okay you you’ve got to know what’s going on under the engine if you’re doing this on your own if you’re programming you need a programming language that and this doesn’t have to be all in one package but it’s nice to be able to go out and get the data and you’re downloading mini files so if you want to

download all the 10-q 10k and all the variants from Edgar there’s more than six hundred thousand files so you don’t want to have your RA just doing that over and over and over so as tens RA you know I want to make sure I have a program that could go through and do this automatically so you want to have some kind of language that gives you access in the Internet and the ability to just download a dataset possibly convert it into a string character variable although in most cases since I know I’m going to have to go back through and reparse I’m usually just copying files and storing them locally you need the ability to parse we’ll talk more about the specifics of that later on those of you that have done this know exactly what I’m talking about but could you do this in a traditional language that it didn’t have access to some kind of parsing engine yeah but it would take a lot of work I mean it’s not impossible but it’s close enough possible it’s very easy to do once you most modern programming programming languages have access to this type of technology I think programming is a matter of religion and so which I can’t tell you this is the one you should use because I’ve programmed a lot and I know that it really is a matter of you have to look at what your skillset is what language you’ve you’ve worked a lot with which language can you convert into easily if you were doing this 15 years ago maybe 20 years ago Perl initially was one of languages that actually was or me noble – textual analysis so you’ll see a lot of or you’ll hear a lot of folks say oh you’re going to be doing working with text you haven’t done that before you need to be using Perl that was true many years ago I as long as you have a reasonably modern language that has access to some of the tools we talked about it doesn’t really matter my friends in computer science will say yeah Perl is good but you know pythons really that’s the language I should be teaching my children you know and it solves these problems also so I’ve heard some people swear by Python I’ve worked with both of these are good they’re good languages is back to whatever your religion is SAS has its own text miner I even use that I’m sure it’s a very powerful tool anytime you’re using a prepackaged tool there’s some overhead so they’re pros and cons so I’m sure that’s a wonderful tool and I think that interacts a lot with the system they have set up a word so it’s just something else and someone who’s programmed a lot I programmed in assembly language on a mainframe I’ve programmed in C++ and you name it I’m programming in a lot of different languages I’m embarrassed to admit that I solved this problem using Visual Basic I mean that’s like programming for dummies when I talk computer programming we would teach that as okay if we’re teaching freshmen in computers this teach them this language because it’s real easy why do I do that since I can’t pound my chest and say well I’m doing it an F sharp or an L C++ or whatever I can I can solve problems quickly in this language I like the visual studio develop an environment and I know you can do this in many languages and visual studio but I can just I can solve problems quickly okay so that one today when I decided to pick a language again it’s a matter of religion I’m not recommending that language but it’s a pretty simple solution and also I think if I were trying to convert some of my colleagues some of the old folks in Fortran over I think the easiest to the murderer was language so one of the tools that’s critically important in going through and analyzing text is having a parsing engine which is typically done with regular expressions so you’ll hear these guys talking about reg access and so this is a language within a language that essentially we have to to to learn it allows you to look for patterns in large quantities of text so what’s driving like when you’re in Microsoft Word or something that allows you to go through and do some things very quickly with text you know it’s driven by this type of engine in many ways let me give you a simple but complex example and I actually just kind of put this in to come up with an example within the presentation I haven’t really gone back and talked carefully about this and let me point out as you’re doing these things this whole area of textual analysis is is very imprecise when I started working in this area it took me a long time to get over the fact that there’s a lot of emphasis in your dealing with language you know if we’re talking about regression analysis you invert a matrix I invert a matrix we’ll get the same answer to within eight decimal places if we do it right we’re doing this stuff you you tune regular expressions you tune a lot of your algorithms and you try to get them up to the point where you’re comfortable with

them but you never get the right answer exactly and what’s important is that you tune them to your application that gets back to don’t use the black box so if you never work with regular expressions let me give you an example they look this is all one line in a regular expression and then you can you can go into a feed a document that’s essentially stored as a string to this regular expression and say give me a vector of all the matches to what I’m looking for here hey what am I looking for this attempts to identify sentences as I define them and so this is a look at that operator within the parentheses I know that because this question mark or equal to tells me needs to look at which is saying I have a target Daniel this is what I’m looking for this stuff here and the targets going to be preceded by what the reasons although this is what I’m interested in is because it’s preceded by this time okay so what’s this stuff well let’s think about a sentence in a doctor how do I know it’s a sentence could be the very first character in the document in which case there’s nothing preceding it they’ll be able to miss it that’s beginning of character first characteristic or so the bars more anything that’s an operator within the language has to have an escape character so within the brackets I’m saying any of these characters so they have escape characters so what am I saying here any punctuation so it’s either the beginning of the string or what there’s a period a couple of spaces back right so period explanation put with a question mark a space one or more spaces two plus so a backslash n s puzzle over space white space which could be you know this is a space so one more spaces or I call it a sentence if I seen this is a line feed this tells me two or more I could have two to four but this is two or more line feeds so if I’ve had two line feeds and I see a capital letter and a bunch of characters I’m going to say well that’s probably a sense to notice this is a matter of opinion to sense them so that’s that’s my looking at operator what am I looking for I’m looking for something that starts with a capital A through Z if you start sentences with numbers which I don’t know is if that’s technically incorrect or not to the lab debated that I won’t I will call sentence I’m not gonna be a hard person right and I’ve learned a little bit so a through C followed by any character except so this is saying not not these characters with is a through C followed by any character except the punctuation mark a line feed and we have for our own purposes to find the sentence as 20 characters or more and you can worth it with within a particular context so in our case within the context of say ten kegs if you look at shorter sentences you’re more than likely capturing headings and not sentences so we just somewhat arbitrarily not completely arbitrarily said we need 20 characters and then this is a look behind out there so when we see a bunch of characters 20 characters starts with a capital letter and preceded by this stuff and followed by so I had 20 characters question mark equals followed by a period explanation so into the sentence right followed by a space or it could be the very last sentence in the document that’s the last character in show I hope it’s awful I move out of this right so he’s gone as a down by sentence yeah that’s that’s just identify site so when you people talking about regular attrition what do I do so as I indicated I’ve got whatever document you’d like to give me you give me the document I feed it into this red X engine so it’s just I might label this are X and define it and then I say are X dot matches and it would feed me back a match collection which is a vector of matches everything that matches the pattern I have asked for here and that’s the power of how you do textual analysis that’s the Trekker and then you have to learn how to do these things and it’s back to don’t just pull stuff off the internet there so you can go on the internet say give me a red X for looking for abbreviations and you’ll find something that works maybe 60% of the time you just you you’ve got to sit down and think carefully about how am i applying this and where are some of these rules maybe not going to work I being in finance I spent a lot of time trying to read through some literature in the area of national language processing and every lady there is after spending some time with it I stumbled onto this and I think it summarizes the area in really quite well

natural languages are messy and difficult to parse with computers and it sounds like a very simplistic statement but it was said by somebody who’s really an expert in the area and I think it really summarizes the area quite well so I have to throw that slider the other thing we’ve learned from experience is there a lot of trip wires and so when you make mistakes in content analysis you tend to make really big mistakes I know we’ve made them you know it’s you you tried to make sure what is really good at he won’t he may not admit to being good at many things but what he’s really good at is he’s very careful so we go through that you know I’ll go through and crank out data very quickly and then he’ll go through and say bill this doesn’t make sense at all and you know usually he’s right always right usually he’s right and so there’s a few things and there’s a thousand oh nice this isn’t more of an art than a science here’s some things if we submit a paper that looks at in case when the first comments we’ll always get is well why don’t you break it down its components like we didn’t know there were components to 10k you know you’ve got you and it’s usually management discussion and analysis right so so what you focus on the entire document what you focus on what management’s talking about and you know we look at these other papers have done that and we’ve actually done it too because we were told so think about what you have to do there and you say well okay well I can use table of contents to do that a lot of very large firms don’t have table of contents in their 10k so you’re missing those I mean many do but mini don’t if you go back prior to well it really kind of fades in so the the sec side if you’re using edgar starts in 94 it with a fair number 95 a fair number then everybody is required to fire a little file electronically in 96 but in terms of the HTML and you know the demarcation of text that’s not really consistent until some two thousand two going forward you see it get a little bit better so if you want to use any of the old data you can’t rely on the HTML or else to kind of tell you this is what’s going on or you eliminate a lot of your sample or you biased it toward very large firms and you don’t know that because the larger firms tend to be the ones using so this is true the other thing in doing it it was just little things so what are you looking for you’re looking for section 7 a management discussion and analysis did you account for the fact that many times management’s misspelled which we found did you accountant in fact then many times they listed as section 6 which they’re not supposed to they just screwed up and even if you get that right and this is back to making errors so you found section 7 hallelujah you’ve got to find section 8 exactly right and if you don’t you’re going to call everything past section 8 part of the MDNA and it’s gonna be an outlier in your sample and it’s so it’s back to when you screw up in this you screw up it will affect your results and that’s why I’m really reluctant to try and get into breaking these things I’ve worked a lot with parsing out segments of tin case I can give it about 90 some odd percent accurate but it’s back to those last five percent scare me to death a separate line did you have to write catch it right so when you get down to the final five percent they have some quirky yeah at some point yeah do you almost have to write a line of code for every except it’s hard to write algorithm that captures a lot of things yeah if you’ve looked at any of the word list we have we have one that Tim calls these weasel words and I’ll be interesting to see if he can get that a publication some place we call the modal weak words which is a little nicer sanity so if you say I might you know I could possibly do something those are weasel words and so they’re interesting things to look at in financial documents and actually we found that they do tend to two things well one of the weasel words is may well the first time we did this Tim again and his thoroughness discovered there’s a tremendous seasonality in this interesting late verbs filing in the fifth month of the year used it more than others we weren’t able to publish that result but what so what you learn you learn oh geez yeah so when I’m using my reg acts before I do that I go through and take out every capitalized may now this gets back to type 1 type 2 errors so wait a second

but you could start a sentence with me I’m betting you want in a financial document because for me to start trying to figure out is this beginning of a sentence or is this abbreviation I’m one likely to make a mistake so I’m one likely to call the month a word when it isn’t okay so I just go through and take out all the capitalized amazed and then I go through and look for minute again at require a result that could drive your results if you’re looking for model words you’re gonna have this a lot of firms in the in it first pick up a period of time focus on this word which will drive here results also some problems in fiscal year ends in May and they spiked up right no so that’s so much a tripwire or I don’t know it can just my learning curve in this so one of the things once you get into is you have to do sentences which is much harder than just parsing for words if you have to do sentences since these are all based on punctuation right so when the first things you got to do is get all the abbreviations out of there and so I thought well I know that’s not like one line of code but hey five ten lines of code I’ll have this knocked out and then you go to the computational linguistics lettuce your and you start reading you know publications and they’re best journals that are very long debating how you come up with the optimal algorithm to do this and you realize hey this really isn’t very easy so a lot of things that you start out you make that person and if you come up with simple solutions its back to you’re probably making major mistakes I’ve already indicated this tripwire if you if you’re doing this on fire finals especially pre 2002 they tend to not be very structured and there you have an obvious division of the ones where you’re going to screw up because if you’re assuming some kind of structure tend to be smaller companies they’re going to have a small firm bias if you’re if you’re assuming structure in the document so those are some triple white tip suggested I go through a simple application and again we’re leaning a little bit on sec Edgar applications it’s by no means limited that we have looked at other things we are doing other things many people are doing other things it’s just this is where we started this is where we’re probably most comfortable and there’s certainly a ton of information out there so how might I go how they go about attempting some kind of I just labeled a 10x because first thing you have to do is say okay I’m at 10k swell their 10k 405’s their 10 KS they’re all basically the same thing so you have to be sure to identify all those 10 KS B for small business do you want to include 10 Q’s in which case you also have to have the tink you you know SB so on and so forth so let’s just call out a 10x what would I do first first I go to the back of the SEC site and I never downloaded probably terabytes of data from the SEC and I guess it’s your tax dollars at work they haven’t called me I’ve actually interacted with the technologists there I keep waiting for them to contact me and say we’re going to block your IP stop doing this they got great servers they’ll feed this information up and it’s available to everyone so the first thing I do is I go I’ve got to figure out where these documents are and the way you do that is you go into the master index at the SEC which is at this website if you’d like to email me I can email you these slides I’d be more than glad to email me right on your website I can put this light on my website – which will reference at the end but I’m McDonald dot one at Indy dot edu you can look me up on the internet any time so for every year in every quarter and I started 1994 your your order number so for one order to order very computationally obviously I’m just going through some loops right and so for every year in every corner I’m gonna go in and pull down a file called master dot idx I believe off the top of my head and this is with that file look it’s a text file and so this is just what that file looks like so I skip through the header and just start reading this line by line okay so each line contains the CI K which is the identifier that Edgar uses which unfortunately is not directly tied to the terminal you also have to sort of data this is packed one minute so it’s using pipes to separate the text name of company this is the form type the date of the filing and this is the file and I need to look for okay so suppose I’m looking for any tickets you suppose I’m pulling a tick you so I find ten X so I’m going to be understating in this file where is this 2011 so this probably has 200,000 lines

probably in the fourth quarter 2011 less amount of a big period probably about two hundred thousand filings so I’ve got to go through two hundred thousand lines they’re gonna fly all the ones that have the form type that I’m interested in and then I’m going to as soon as I see that form type I’m going to pull the file name that I’m interested in okay identify target forms and then I’m going to just immediately feed that into this URL out to here and then I’m gonna plug in this target violet okay so I’m just looping through pulling down these files I’ll store them locally that way I don’t have to do this again okay so once I have all these stored locally I’ve downloaded 600,000 10x files I’ve ever pull each one up obviously in the loop and I’m going to go through them and do my stuff in this case let’s assume we’re just parsing forwards so we’re gonna do word counts maybe some type of cinnamon analysis there’s all kinds of things you can do I know that but let’s just a similar parsing for words these plot all your filings you can go to the SEC site you can pull up an HTML version of the filing which is very nice and has all the pictures and everything but everybody is required to file the entire filing has a pure ASCII text file and that’s where I’m pulling down so I want to go through these text file because I know what’s in those if you have a pure text file if I included in not that many firms well it’s not extremely common but it certainly shows up if I included a logo a picture in my HTML version of this filing if you know anything about converting pictures into ASCII encoded text so my little picture might take if the documents this big this much of it may be text and ASCII encoded stuff may can may account for this much of the total document this much is what you’re interested in and this is just pictures graphics PDFs didn’t get shoved in there could you look at those yeah you could is not something that we take in specific interesting more recently you also have XBRL which takes huge amounts of space which is focused on the tabular information we take the you know the tabular information from cocky stats so we’re not going in and trying to parse that out ourselves XBRL mixing those extensible business on your so XBRL tells you this number so if I give you a note of 1 billion it has the XBRL tagged saying in dollars this is total assets so on and so it’s amazing the level of detail I give you for each data item we’re not interested in that so I’m pulling out all the XBRL we remove again we’re not interested in going through the financials you could do this we’re just not doing that we remove tables now this this again it sounds like pretty it’s pretty simple because even the less sophisticated firms will debark eight tables with you know hash tags the table tag and so you just this is the Reg asks for everything between this and this the problem is this gets back to tripwires a lot of less sophisticated firms use tables to define paragraphs so their document everything that’s a paragraph is just preceded and followed by a table indicator so if you just go through and knock out tables you’re knocking out all the text can’t do that so you want to go in there and look at it look at you bring in all the information between the table and table you look at the number of alphabetic characters and you look at the number of numeric characters and decide what percent I think we’re using around 20% down I think lower number at what percent are you going to call anything you’ll get it right hopefully most of the time but you’re not gonna get right another percent of the time you know I just have to do this it kind of depends on what the focus of your study is sometimes we do this where we think it’s important in terms there are a lot of repetitive words you know profit and loss or something that I really don’t want to see then I take it out if we’re looking at something else you may not take it out so this is not you need to take this out at this point I’ve played around everything when I still need the HTML markup in there to identify a table or whatever so now I just want to take Calaway HTML if you didn’t thought would probably be the most frequent word occurring in all documents so if you want to take out all the reason this really is important people were analyzing that I usually go through and Andrian code because it’s an ASCII file for example the ampersand sign is always

indicated by hand am you know spaces and in the sp right semicolon so you go through and just take out all that noise basically not critical but I usually do so we’re we’re shooting forwards in this example right so what I first want to do is look at look for I’m going to go through the document I’m going to create a vector of things that look like words because I don’t know if they’re words at that point and what are things that look like words well I use your regular expression I have an indicator here that says ignore case so you’ll notice I’m just saying upper case but I’m telling that case doesn’t matter a word boundary so within regular regular expressions they have because we’re boundary may be a space it may be a line feed so it can be a lot of things a word memory followed by a hyphen or a through Z and I want to see two or more of these so in Gemini arbitrarily decide we’re on account a and I use words if you do in ticket documents you’re more likely going to stumble over headings and count those as words and they’re just not important words so our words have to be two or more characters followed by work done so I feed the document through this reg ax and boom I’ve got a vector of tokens things that look like words then you need to take those and work through a dictionary another feature of the language you’re working with you need something like hash tables or dictionaries so that as I pull in a token I don’t have to go through 80,000 words and say does this one match I can just look it up right so you need some kind of Harry steel okay so take the word iterate through our ticket open iterate through TVs don’t but see if it’s worth if it’s a word count how many times do this work so we’ve got a master dictionary we just go through that master dictionary list we tabulate the word so we’ve got two kind of works and if you’re focusing on CIMMYT the words that are subset of those that are in a particular cinnamon list we’ll talk about a second you have to account for those yeah the total number of words so you have the proportion of positive words a negative words those kinds of things question again that wasn’t working first of all in our dictionary we don’t include proper nouns and geographical locations those types of things we don’t include abbreviations and there’s just sometimes there’s just some garbage in terms of characters that may be defining the table in the old days or something so you run into just some garbage in the document okay so this just you’ve got a count of anyone you just got to see is this a woman your your dictionaries 185-thousand it’s around baby something I don’t know so and that’s what the PowerPoint that’s not a word and I’ll look it up over the end of the legal word but for example we don’t we don’t do a lot of chemical terms because you know that’s going to become essentially a dummy variable indicating this is you know it’s an industry dummy variable so we don’t do a lot of chemical we don’t do a lot of specialty well I have dictionaries later but you can go on the internet and you can find dictionaries that range from about 30 in English dictionary the range of about 30 and I’ve discussed this with some of my colleagues in terms of doing this in other languages and we debated whether it’s more or less challenging so you can find like dictionaries with and technically you know if you’re in this profession though those really like dictionaries they’re just worthless technically you’re right you’ll find word list from like 30,000 to over a million and so you’ve got to decide you know my plan of Scrabble here or am i what am i interested in and we for our recent deciding you know we don’t want proper nouns when I didn’t mess with that we started with a reasonable list of generic words you know common words and then I went through all the 10 Ches and looked at tokens that weren’t in the dictionary and look for ones that occurred most frequently to add words that didn’t happen to be in this dictionary that we pull off there are lots of great word lists anybody know whatever lots of great word lists now on the internet because he was trying to hack passwords or Damini wordless unique dictionaries maybe lookups so you know so I can go through and just go through this dictionary I’m trying to break into your account Oh word let’s play a very important role so there’s a lot of stuff out there not because finance researchers are interested in I don’t know how important this is but it’s it’s an issue and you see some people in one direction we chose to go another I’m not saying we chose the right path but I have an opinion on so we creating work less so

we’re going to create a word listen look words we think we’re important or four-county words should we list all the words lexemes and stem so the stemming means both do it in the next line or expand over words to include inflection we expand all the root words and count inflections so what’s stemming there are plenty of good stemming algorithms out there so you can you can pull code down that does a pretty good job in this already you don’t have to legend if i’m counting words like expense I’m just going to lump them all so if I say it’s been seeing if I say expenses if I say expensive those all count is this and I might even take this off and just call this by lexing so everything that looks like this is going to be as this particular token and that’s one approach that’s stemming or you’re taking all your words and trying to get them down to a common where there’s a complex we go ahead and inflect everything and expand the dictionary to include all the possible inflections why because it’s back to I’ve got to know what’s going on any bit of it I better know exactly what’s going on in terms of choices all these are good examples but certainly in our profession you’d understand you would always say pods and parse it down to odd Canada’s on Auden odds at least in our business aren’t the same thing so this doesn’t always work and if you look at the literature on this textual processing literature shows that stimming does not in general improve performance why because it doesn’t work well for morphologically rich languages you know the English language is not very good in terms of where we go to past tense and all the various rewards we’re just not very consistent some languages are you know supposedly Germans much easier to do because it’s much more structured so if you have a very well structured language stimming may be a great idea we just we don’t think so that’s subject to debate okay so next I’d like to talk a little bit about word list because we’ve put together some word list and why did you put together you told me to know because historically people have looked at sentiment and those types of things if there’s not a word list why meeting users well it’s back to don’t use black boxes but we’ll see that in the south tensho to presented this it’s like driving with my wife so what did we do to create a worthless first we uh we we create our master diction I already talked about that so we actually started with word lists and we added it to it words that are out of the ten case we want to make sure we have business words like an our original word list I don’t think accretive I don’t think was in that worthless but it’s certainly one that shows up a lot in antenna acts type documents so we added that to it so we created a dictionary then you know how do you do it it’s just just it’s just hard work you go through that list and it’s subjective we were caught in a situation where we thought some of the other word list of we’re being used weren’t very good and and I kept struggling with Tim on this so how do we go out and objectively find some word list that works better and there’s nothing out there I mean it has to be developed in the context of what you’re doing so finally I said let’s see if we get away with we’re just gonna have to create our own it’s subjective we went through the word list for all words that are used more than five percent of the time not including this life the boys their common words we also include their inflections why wouldn’t we go farther down that list well it takes a lot more time and also if you get to the point where you’re identifying again rarely used words they become indicator variables for specific companies so if the word elephant is only used by this one company and you count that as you know an important word or something they’re their ways of creating problems in all this so we try to avoid avoiding rare words and just classifying those that we see and what’s our very objective standard if I’m reading a financial document so if I’m looking for negative words if I may need a financial document and I see this word am I more than likely to be concerned that’s about as fuzzy as it gets I know if you come from the quantitative background but that’s the standard we use and we put a lot of time into building the list and then when we were going through the the paper that we published in journal affiliates Camp Harvey who obviously has his own Phineas dictionary so I think he has an interest in all this spend a lot of time going back and forth with us along with an associated reviewer let’s think about these words let’s think about these words so there’s been a number of eyeballs putting together the list that

we created so what are the I don’t know that this is an exhaustive list listen we came up with negative orders so what are examples of negative words lost bankruptcy the denim is felony this feisty felony and a filing I’m gonna read that sentence again that concerns me how many of those are there we unfortunately so we did all this initially for the Journal of Finance paper I forget what year we actually did variation Airy from that my name is 2007 maybe the language and the usage of language in these types of filings unfortunately for us changes so I’ve tried to update in 2011 and make things a little more positive works far fewer beneficial excellent inter baby there are far fewer that are unilaterally positive and another important issue no 15 years forty minute unlikely you don’t have problems with people using negative words in positive context you don’t see people reporting these are not terrible earnings right but a lot of people is a little bull and you’ll see them do this to publications and it makes a lot of sense prima facie at the first glance you go why don’t we look at the net kind of positive minus negative over positive plus negative or something like that right but if you actually in Tim is the one that does most of this if you actually go through and read the stuff read 10k filings they read some of these situations which you’ll find is that many many times if I have to say something negative now think about it so if I’m writing a letter to Tim to fire him and I’ve drafted that on if I’m writing a letter Tim – Tim – fought the firemen I probably got three paragraphs I’m saying Tim you’ve been a great colleague you show up for work you eat with silverware you know I just I just want to say all these Polly superlatives all these positives and then somewhere at the end it would say unfortunately we were going to have to dismiss you and so you’d go through and net all this out hundreds of positives three negatives so this was a really positive document which you’ll find the people writers tend to frame negative statements with positive words and so we don’t put much stock in positive word list but you see people doing it we’re reluctant well doing sometimes because you’re 49 negative word list why don’t we just pick the top 10 okay so another thing and some people who had done this thinking well why did they do do they use this approach why don’t you just run like this essentially massive regression looking for large responses on filing days against counts in your dictionary and just identify the words that really have a lot of kick to write so makers 2007 listed here maybe there’s 25 words that really have all the care why are we just using the huge and dodging aiding problem yeah so we discover if you use the word aardvark you’re going to lose money next quarter well what are people writing these documents do they stop using that word you got a vote no I think I think it’s important reason though right because go try right around 2000 you know 349 negative works good luck right yeah so listen occasionally 10 makes a good point um we’ve tried to make an exhaustive list you know Nana and getting it first night you could say Nomar should just list the real important words no we’ve tried to make it exhaustive list so if you’re trying to write around us it’s harder it’s not like they’re these target words that you can avoid so beyond positive negative which is really obvious certainty words which is not necessarily just risk so we do have things like this but it’s not you know risk standard deviation bearings or something like that but also ambiguity approximate some more uncertainty works versus purely this although risk were to include minute occasionally of interest do you have a lot of legal sounding words the litigious words and that number we started out with when we first did this 50 and then as you go through and all these tokens we’re not sure for the words you’re going this can’t be worth you know the lawyers are using a lot of words that if you spend a little time we’ve factored it out these are actually words already referenced these these are actually pretty simplistic and pretty short list but in the stuff that we’ve done they’re pretty useful words when you tend to use things like all of you have always best definitely – Louis we will do something that’s a pretty strong signal when you’re saying could depending may I capitalize possibly

sometimes weasel words that’s usually signaling something when I present that you can see point out that the strong moral or the language of parents you will do your form you must be on time right I have two teenagers weak modal is the language of weasels seriously and actually it does make it actually weak mortal is often really strong in the 10k they’re putting in all these weasel words that should tell you something right this is from a very old paper but it’s it’s back to I think it’s a very good selling it gets back to don’t use off-the-shelf technology don’t use black boxes content analysis stands or falls by its category particular sites have been productive to the stem of the categories words we’re clearly formulated and well adapted to the problem so you really need to develop wordless in the context of what you’re doing we hope that the list we put together work well in the context of financial documents but for example suppose you say okay I buy that so I’m gonna do that you know you say this Twitter stuff that’s fashionable I’m gonna do tweets well sucks isn’t in our financial dictionary yeah and you know and all the irony that’s usually in tweets is so you really have to spend a lot of because I’ve worked with us you have to spend a lot of time trying to sort out when is negative negative when when is negative positive and the line over here that we’re dealing with but it really sizes you need to develop a dictionary in that context you pull Aras off the shelf and use that you’re gonna screw up why is it so important to keep track of what’s going on into the hood there’s something in in in natural language processing cults this law it is not a law but if you’re you’re you’re you’re familiar with power distributions and how many it’s interesting how many things turned out to be look like power distributions and the example I’ve uses the distribution of words is like the distribution of market cap and Finance you know sometimes it’s hard to convey to students there’s just a few really big firms right another a lot of small firms and there’s a few big firms driving everything it’s even more so true when you look at words even if you exclude you know some people say why don’t you take out stop words we don’t do that I’ve got a list of stock words out there if you want to use them I don’t think it’s important but clearly V I think it counts for eight percent of the words out of ten billion words on all these filings from nineteen for 2011 ten billion words v is 8% of them the next ones around 5% which is I believe the next one over 4% which is and it just drops like a rock but if you take any reasonable subset of words you see the same pattern and what that tells us and this will throw up the table from our paper because that’s the essence of our paper and again attend to review in your paper if you say we get this particular result with this word list he’s gonna say show me the top 25 words what’s driving the result because those top 10 25 words will account for a huge percentage of the total count for a particular sentiment okay see you it’s back to you sometimes they’re goofy words yeah goofy so that’s that’s the punchline from her journal finance paper I mean I I’ve learned from working with Tim and if I can produce a table of numbers where he laughs hilarious we’re doing well because I showed you this list it wasn’t exactly in this form initially in terms of the people that are doing this stuff in finance are using the Harvard social psychological dictionary which is a sentiment dictionary that was developed in the sixties at Harvard in sociology and psychology and folks that are doing financial textual analysis are using this word lesson love me look lift up the hood and show you what’s producing the result when you say these are negative words and this is what for quite some time so when you’re reading a 10k document if you see the word tax do you get upset is this unusual does that scare you know the ones that are checked here are ones that we also would include as negative loss losses impairment yeah if that concerns me against adverse we take those now but you know cost a lot of firms have cost I wonder that that concerns me again because we’re taking the word list from another discipline I’m not sure why but in the Harvard list board is a negative word I don’t think that spills over it took him about 30 seconds to figure this one out because he’s been every longer than that I don’t want to reveal too much right here I’ll tell you

gonna code in error no because he’s thinking why is somebody using the words bison their 10k that it’s advice is a negative work I said Tim vice president it goes on the cheese and you know not surprisingly it shows up a lot in these documents you’re counting those is name so if you’re running this off-the-shelf word list and generating your results and you never really look at what’s driving those results look at what is driving those results or red yellow warning is a negative work right for an exchange like oh yeah so I’d appreciate you that’s one of my favorite ones too thus the title or a paper when is a liability not a liability that was because liabilities and notice in terms of cumulative percent so if I go down this many words on over there that’s maybe 25 I would elicit up how many do we have deduced oh this is not awesome this harbour doesn’t throw this around 4,000 of inflection so the Harvard negative word list is like 4,000 some words our 4,000 words in terms of tabulating them 50% of your account is attributable to just these words so that’s again if you’re doing this look at with dr.xu results why don’t you spend a minute talking about the word mind interestingly yeah so some of the words we didn’t list them here and there’s another thing that somebody point out to you that I didn’t you call it sad no kiss that yeah so there are clear words that are industry flags mine is a negative word in the harvard dictionary because if your children grab a toy and stay mine that’s not a good thing interestingly there are certain industries without words used online so once again you’re going to you’re going to count their documents as being extraordinarily negative and if you don’t check what you’re looking at there it’s a problem always having gold your skull also we got Gold’s of positive there’s a certain industry route is also a negative work which is a negative word as a crude no why there you go and then then there’s once again some industries it’s huge okay I think we’re about so what resources are real I don’t know but I’ll you know I’ll share about anything I’ve got with you I have all the the files downloaded and actually prepar so if you’re interested in all the ten caves in Texas that’s new Jesus I don’t know it’s maybe 50 or a hundred gig I actually sent it to someone in Europe on five thumb drives because we’ve got access to box anything else we do know we have pre parse all these files to strip out all the stuff where you can just focus under the tech stuff but at this website I’ve got a bunch of material out there that you may or may not find useful all the cinnamon dictionaries are listed there again we’ve updated them to 2011 but I can all you know we can also give you ones that we use for the general finance paper what we find very useful everything they do is having a master dictionary so what do we count as words and when don’t we count as words and actually I have this set up into the spreadsheet so it it’s an excel file so for every word it actually tells you the frequency that that word occurred in all ten caves I think I just did 10 K through this counter so you can look at proportionally and financial filings how frequently these are these words we use and then it tells you which of the sentiment classifications so there’s a you know essentially a 0 1 variable actually has the year and it to the list telling you this word is a negative word so there’s a spreadsheet that has a lot of stuff for a master dictionary you might find that useful if you notice the addition before some people because if you’re doing this in political science and some other is it’s useful to take out stop words I don’t find that particularly useful I know that the of and that are going to show up a lot and I can’t account for it so I don’t bother to take those out if you would like to take those out I have some fairly good list there the problem with that is it’s a matter of opinion as to where stop where it is so I just assume not get into that that’s I’ve also posted a spreadsheet that has every ten to ten cake and a speed a four or five whatever lot each line represents every finally and it has counts for all the different sentiment

less than we have it has the total number characters and the gross number of characters so before I had stripped out all the stuff the gross number characters the net number of characters the number of each XBRL characters the number of asking coded characters so if you’re in accounting and was interesting to say does document better struck do better structured documents create bigger earning surprise or something you could take this spreadsheet and do that study in about thirty minutes I think not the best particular interest so this file yes sorry i get in trouble my classes from their channels my daughter isn’t counts so it’s okay this file is about a hundred but you know you can download it really quickly I have other resources that are ticket for you more space but as I can be glad to share it right now Tim you have any they add and we’re certainly open for questions I’d like typos so what’s in a while look journalist and I’ll freak out I’ll say depreciation is you thirteen times in our entire document and like we have a major air it’s a misspelled word of depreciation right and then we also have it we got this one in print where Arthur Andersen the economy I forgot about that yes kind of a Firebug vespers names misspelled the name 60mm it’s worse than that it’s sixty nine times on the letter of full audit of the firms that were signature wine harvest the inertia we renounced all names yeah so what Bill Bell didn’t anything we had a paper and we wanted to control for auditors success so bill went through and look at the names and then he gave me the documents as usual eyes ago got an arrow right here there was never way that I thought Henderson did so few honest and he’s like I couldn’t write I’m gonna need it I didn’t write and then I pulled them up and I always it he misspelled off Anderson with an Anatomy yeah I know so there’s a lot of there’s a lot of any result there you know but so no I think I get the rib it is kind of funny isn’t that you look like how about like so oculus is going over to this photo shoot in North America now we have births ephram’s maximize yeah you know I think we counted that in our dictionary and he done this week that’s necessary do you describe all the as you probably well know now whether pendeks is where you put everything so yeah we usually try to go through and web appendix write down a recipe for what we’ve done now even if that man it’s hard to replicate because what we should regular expression specifically that you used to do this or exactly how to channel it’s gonna have a big impact on the outcome so this is a big fuzzy part of our profession a lot of times we do is we follow walk on the dollar down which is just laziness you come across a lot of new words uh yeah so and you know I wish I I can remember off the top of my head what those are but as I indicated we did this in 2007 so what did I do when I went through all the more recent filings and look for things that look like words that aren’t in our dictionary and I forget how many I added personnel there’s 30 40 and you know the language you know I haven’t included that yet the only proper noun we include is Shoals you know because if you know people are interested in looking at rocks and stuff and blacks in there black is a word but Shoals is not and it occurs a lot in these documents so we by Fiat decided that is a word we’re not what we’re not parsing at that level it’s it’s back to you’ll notice the example I went through the example I went through is where I would be most confident if I read the papers of doing this I’d be most comfortable the results if you start telling me that you’re diagramming sentences you’re telling me that you’re building artificial intelligence and we’re really not very good at that I mean you look at the people to do it for

they’re not very good at it and they spend a lot of time so the deeper you try to parse the more I’m sitting there thinking your results are driven by a mistake but the simpler and that I think the traffic to coming up with great applications of this particular technology is to come up with simple tricks Tim and I did a paper on business ethics we just look for ethics ethical ethically corporate responsibility you know there’s like six words we look for that I’m very confident in these nights but if I start telling you that I’m capturing the nuance of words don’t believe me the other paper we did is we could code of ethics separate from the core assets do you think people copy the core ethics yes plagiarize the heck out of them it’s not somebody else’s and the trick is you don’t know who you’re passing the bill rodas kind of a key program that basically took a snapshot a photograph of each sentence and they compared it you know we compare the billions and billions of sentences and then we’re like to copy the whole thing except they change the name of the firm what your mother would say that’s good practice right because this is a document this survived in courts so you copy you know everybody in HR copies married on HR stuff that’s what HR people tell me and then also we thought about the NYC said help you out look things like this is the code about this and it saw how many times think it appear all the time it is a really weird presence it’s very important but it leaves every man what we’re going to write all this you don’t write word for word but if you use the legal excuse for copying you really show up on when you resell the census because some of the sentences are wire you know ethics is is at the core of what we do in our beliefs and values of this kind of telephone we feel happy to LaVey the flag and the secular song and they read this sentence you go wait a second this showed up in ten other documents for the comment was over 100 so there’s a if there if there’s not in the academic material in all this there’s certainly some humor in other questions come on up there any questions like that thank you for showing up when we appreciate the opportunity to talk to the group and appreciate Jane