Search at Tumblr by Yufei Pan

as many excited to see so many of you can come up hope you all had I enjoy about session so if I say I just want to quickly each of a tumbler itself so couple of was founded by David Cobb in 207 is open years in a huge popularity as publishing platform as Brenda we had all day at 160 million plots and yeah it’s news number and this into post we had on inserting million posts method content and was really really great we also social network people who have become we just can’t that was my very nature free content I just took a screenshot of my dashboard so that you can know looks like by the way could question how many new users answers so with such massive content search become a very fun way and very effective ways and content you know every day we are seeing about the 50 billion Perry’s my user but as diplomacy these couple to serve a long time it was the certain functionality was very limited from seven and experienced some 12 in fact after you willing to not certainly the pretty much the only search available and couple it was contact page essentially what given Italian to return visitor posts tag ID post id and attempts tab so when user type of query everything convert into one single tag ID then return the posts in the reverse chronological order so it was very very basic it works okay when your search just one single tag but anything more than that become almost impossible especially when we have such massive content also there was no easy way to find the blogs we do have a human curated directory because spotlight so people can kind of navigate this directory to find blogs am i interested but again this is just tiny little bit about all the blogs that we have so in 2012 the leadership team at tumblr finally decided to have dedicated team to take on search and discovery because it’s become so much more important with fast growth of users and content so in July 2012 Zach who is also here is the first such engineer John tumblr then I followed then we expanded beta Patrick Adam so together work with a design team and from a front-end team we were able to build a set of exciting new search features from grata in 2013 including post search block search same search personalizes recommendation and a bunch of things so your next couple of slides I will quickly walk you through what are the features available at search at tumblr they I will talk about the underlying architecture frameworks and some of the details about implementation first and foremost this whole new search main page we launched last year was really bring our users setting to be able to a totally new level because not only allow you to do post search it also gives you the very decent capability to such nice blogs it also offer you the related terms you can search more for the post search there was no longer a limit on single tag it’s a full text search it also gives you the ranking results based on the combination of recency popularity and also text relevancy and it allows you to filter results by post type post type is one very unique feature tumblr is that we have seven predefined post types it has

a lot of popularity among users so for surgery we also allows it to do that for the blog search pending blood is one of the top proudly for tumblr because – quality is I just mentioned in the previous slides we have this concept of dashboard is similar like Facebook’s newsfeed and Twitter’s timeline essentially all the posts coming from the blahs you follow up here in the dashboard our users then more than 50% time in their dashboard consume all the other’s content they are interested so two very large extend the experience of tumblr user decided by the dashboard and dashboard decided by who they follow so help us to find the great blogs to follow it’s a very very important we build this blog search not only you can search by name or title we also find we also look at the other posts published by this blog and look at the engagement of that post for related to the query you are sending then we decide what are the blogs really for this particular query top blogs and it would return our users and driven a lot of follows from that we also bought a given blog trying to compute what are the best posts so user can get a very quick preview of that in addition to this new search we also build other completes which we feel like it’s not only a convenient functionality but also is a great opportunity for tumblr to show our user interactively what we have because a big problem for us is really that user to discover great content at tumblr we have such massive content so when people type we will give them this interactive autocomplete I think it’s a great way to get people more engaged it’s also used not only in the search but also when people create a new post they can do mention mention will be all complete and for the tag we’re trying to suggest relevant attacks when they type we also searches also powering the personalized recommendation for again for the same priority we want people to follow really great blogs so we’re trying to help our user by looking at the social graph who they follow and then we look at the people they follow what kind of blog they like so we come up with a better recommendation for our user we also utilize uses a little bit history activities to help user to get a better blogger to follow same thing we do – paal digest which is we trying to get the best post in the dashboard in a weekly base so we send email to users that helps our get people more engaged we also in the 2013 very exciting we we also in addition to all search we did canned discovery so this is also another effort trying to that our use and know what are the interesting content at tumblr and we can push to our users so user will get more engaged than the you tumblr more we do chanting tags and chanting blogs and also you can see we create nice charts so that our editors and sales team they can actually understand what right now trending and given this chaining tag how the historical chains it looks like last search we did was statement searches the same is a one very unique elements at the tumbler so we allow user to customize their blog and the same is people can create a themes and even sales themes to other people so it’s kind of like while the commercial part of tumblr and for a long time there was no way to search them those again it was a directory you have to go different level of directory but now we also make that a searchable okay so all those are kind of such features we build into thirteen now I want to go a little bit in the engineering side talk about how we built this especially within a year with that team of six we are able to build pretty big set of features I think one key things we learned out when we look back it’s in a very beginning we feel the two framework which kind of like generalize a lot of common commonality common functions of different service so that we can utilize these common frameworks to apply the different service so reduce development

time in this slice it’s a lot of stuff here but if you look at in the left side you can see essentially the three layers the first layer we had a such online layer this is the layer interactive was user request so when user send a query seeing from one of the our service either his post search service or tab head the online layer will take this request and auto service a building on top of this such online framework which handle a lot of common parts then the online layer would dear viewers will the data layer data layer contains all the key data which other surface is built upon for example there will be index for the post search and for block search another type of data is we call a signal that many use for ranking other service they have their own ranking formula and their own set of signals so it’s also serve the in this data layer another type of data in this layer is we call a pre computed search results because tumblr has very high carry traffic and have lots of data so we do a lot of pre computation using our Hadoop cluster to pre generate those such data so that we can really do search fast so the sort of layer we are here is we call it such offline because it’s pretty much independent of user query we can do we can take a little bit time to do the computation for the offline the two major input data one is more real-time stream we call the fire geyser kind of similar to Twitter’s to your stream we also have a lot of batch data coming from Scoob which is kind of copy of our math secure database and we also have a lot of square logs those scrap logs are more from the user activities and so coming from our PHP app which can is a lot of valuable information so with this two data we again we have a certain framework which helped to build a lot of different jobs to analyze and the process those data to generate all the different piece of data here in this is signals and pre computed as search results next slides I will give a quick summary of the like software stack what are some that were using you know such online layer we may be we’re using this hp+ edge a proxy ng X and the memcache is is used every wheel for the caching purpose then we use icinga for monitoring and the for scrub to collect our logs yes that is always memcache essentially doesn’t have a big differences they don’t have mem catenary persistence for Reddy’s we not only use as a cache but we we have a great in-house software called a discover so essentially we use that a more like data distributed data serving layer it’s very scalable and it’s very fast it has a persistence so that’s the difference and we use album case TV to collect all the metrics so that we can easily see how kind of report in the data layer solar is more useful serving others inverted index data Redis is a great way to have this persistency to save to serve signals and pre computed data and my circle is the where we store the main documents so all the meta information about post blogs they are storing the master kill and the follow such offline we we have a pretty big Hadoop cluster and we run different jobs returning hive in pig in Scouting and pythons many for streaming and we also have a Java program studio to interact with solar to do the indexing for more real-time search so that’s the stack used at tumblr search so for the

search online framework essentially cities won’t really kippis for the search online because it kind of provides its first boy the generalized the searching flow the common flow into about six stage the query passing stage the retrieving stage and signal fetching then ranking then fetch the documents to end to filtering pretty much all the service follow this kind of pattern so we have a Wang kind of implementation we call such base implements all the come on boilerplate code for example how do you do excuse the hope flow how do you pass result from one stage to the next stage how do you do caching we have multi-level caching application can config how do you want caching and how do you log other different piece of data the profiling of the performance also you can choose to execute our search either in a synchronous mode or asynchronous mode then we also have this editorial capability so that you can basically on either you can kind of block some of the results or inject some results so all these that come on to the such a service you build on top of this framework each service the only thing they need to worry about is really how to pass the query and based on credit has to query how to retrieve a set of documents from much larger space so you send it we call candidates documents then what are the signals you can attach for each of documents they need ranking so very specific to each service then you wrote a kind of a class by the way this only framework is Rena is reading the PHP for many of the reasons because we already have a lot of plumbing work at PHP we have by secure timing plumbing so we choose to do the in PHP even though it has certain limitation of the PHP it doesn’t have really to concurrency but for for right now we are happy with that we might want to move this order from the our PHP layer at some time and for the offline we we also build color batch processing kind of framework so we generalize a lot of building blocks so we can construct the jobs to process data very easily as you can see we have four type of different jobs and the underlying we have focused and with some very generic term generators so we can do quickly very next generator the term and also there’s a topic a indexer and we also have a allopathic illusio – which is have a many class that can be reused which is written in the scouting father was our we’d written others jobs they can compose as a workflow with this workflow engine we build in-house so this working workflow engine will first resolve all the dependency because each job they came in depends on the upstream job the results it’s all pressed in a configuration file this entry will take this configuration and construct the book flow and execute it also providing some common function I like you can do the versioning whenever it’s generate new version of a leopard you can do the difference generated we call it Delta so when we propagate this output to the data layer only the difference the Delta palette needs to be propagated usually is only half while 2% of change we’re learning to propagate and because the data is very big and it’s all very important that we have this integrity check because we don’t want our users see kind of incomplete results so we we spend a lot of time to make we have a good of verification integrity checking the data and have a lot of like failure detection and alerts it’s all building into this frame this workflow engine so each individual job don’t need to worry about this so this is save a lot of developed time and also make sure the quality of data is good so that was more about in the in general of the frameworks architectures next couple slides I want to go to a little more

details about implementation so first thing is about indexing indexing is very important for search so and the first question we want to answer at tumblr is very specific particular environments that we have a lot of data as I mentioned we have huge amount of posts we have seventy seventy two billion posts so how to index if we index of them does that really make sense so with a lot of like kind of debating we we do a mess and we realize first of all we don’t need to implement all the posts especially the real blogs which is a big portion of others posts because we divide the post into original posts and reblogs so first cut is that we’re only in this original post but even with original posts if we look at back to all the way to the zone 7 it’s a lot of them and the best we can do is we need a 600 machines back then 600 machine for tumblr as stolid was a huge number of machines I mean back then I mean before followed by Yahoo right so we did a callosum design and we end up with a solution is that we view the street here in the index the first year is we call it recent so we take last six weeks of original posts we index of them then we also look back four years then we we have a certain criteria we use popularity as main credit to select because of popular posts then we also have this old tag index which is not that great but it’s to can serve we’re focusing solar failed it can’t store as backup so we have this three-tier indexing scheme so we were able to manage down to 40 machines which is a big like cost-saving and we did a some testing in terms of coverage we take some random parries and go back to the hive then we see if we have all the post index what’s the difference and difference really small because really for most prairies we have lots of content even in last six weeks and if really some real queries then we go back at this for years with this popular post if still you cannot serve which means either the posts you’re looking for are not that great or we just don’t have that so and also with this voting machine setup we able to serve up to 4,000 per second which is non cached which means go directly to the solar cluster another thing we do is because we’re trying to really minimize the machines won’t use is that we make a way that’s very lean we separate all the volatile signals from the index which means for example the popularity which is keep changing for posts we don’t start with in that within the index which eliminated a lot of rain magazine efforts it also give us a more independent way of how to implement more signals how to store them how to serve them another Lane indices we don’t store any text information other than the index itself so the main advantage is that it’s chopping the memory footprint of the index because we our goal is trying to fit as much as index into the memory as part because that’s dramatically increased performance and reduce latency so the so that’s the things we do for indexing and for ranking right now is still very quickly evolving we we have some decent ranking but it’s far from perfect and right now in the production we use a lot of small set of signals like global popularity which includes the counts of for example likes how many comments people like that post and how many people the blog oppose and how many people follow a blog it’s kind of global popularity we also have this concept called local popularity for example in the blog search we will look at a query term not only how many total likes of this blog received but we also looked only the posts contain this term and then we try to aggregate the likes on those subset of posts is canva we project a based on the query count as user and query and you know also in a personalized recommendation we also look

at the user itself we not only look at the overall follow counts of a blog but we also look how many followers are coming from your friends so with this kind of locality respect to exactly which user and which query it actually improved relevancy quite a bit another important signal we follow recommendation we have this will be a be testing framework so basically it allows to put a small percentage traffic for a given version of recommendation so focus on 1% 1% traffic then the main magic we see is called a follow rate so for a new version if this follow rate is better than the previous version that we can see that it’s it’s a game but we also break down our user into different pockets because some time some some version some algorithm may work in a particular packet better than other packet the packet that we usually / I’ll use how many people they follow right now you guys this essentially see for new users they usually follow less people and for more senior users they probably follow a lot of so they have different behavior so so that’s how we measure our versions of implementation there for textual relevancies I was a very important for it’s very for post search so we we look at how your query matched we’ve seen the posts easy the exact match what’s the proximity of the different words and also we we differentiate do you match against tag or some other text because tag of a post usually carries better relevancy because people spend time trying to summarize attack the post with a good words that that usually carry small weight then reasons is also important for example post search so that’s the kind of ranking and it’s a big direction for us to continue to do combos better signals and do better ranking for the duplicate elimination at tumblr we have a lot of duplicate duplicate content some of them because it was sort of party to allow user to do blogging but we don’t know easy to blog and the some of that is user they copy the content and you just pretend is our origin opposed to came their popularity so sometimes is very embarrassing when you do search you see lots of likes almost identical content so we spend some efforts to eliminate all those duplicates we do to kind of duplicating them nation wise in the index time using the posted signature mainly look at the tags of a post and also we require tags has to be more than a threshold numbers so that we don’t really kill too much kind of like ko instantly they have same tag this NH a pretty big number in a search time we also look at the media the media hash because a lot of posts at tumblr they carry media information in this photo video audio so we have for all of them we have a hash so for those hash if we have certain way to match them against each other and so that’s we also take advantage of that to do the media levity another thing we do is call a new de because sometimes some posts that same but people might end up with adding one or two tags or remove a few tags so so we also have a way to detect this with certain precision the the thing is that we have so many great content which give us a bit space that even would make some mistakes we still probably have some great content to show up well the cost will not detect those my lead worst user experience so so we tend to be a little bit more aggressive right now to do the de also search platform so this is about like solar electric solar cloud which probably just a lot of debate like we should want it’s better what I’m trying to do is just see what we tried so at tumblr we first actually we started with select search all this was actually helping us give us some helps on solving some of the problems but for quite some time we there’s some many issues that electro defect and class the forming but how do you form a class it was there is very like a black

box you have a little information about this and when the works is great but when sometimes when you start a class it just doesn’t form and enter into a color yellow state it’s not red but it’s not green it doesn’t work and our engineers we usually will spend a lot of time and time digging and interest is really to even understand why what causing this so even with I mean all the expertise for me are these we we didn’t we are not able to get a bottom with it so we switch to the solar cloud because solar kilala also offer very good functional items the clustering distributed search it works pretty well – we put a lot of load we trying to simulate our traffic load by studying at the same time doing a lot of indexing a lot of search then we’re starting to see problems well the major problem we observe and we don’t have a good solution is that because we do a lot of replicas to offload the search traffic and what we found is the one replicas can slow down the whole indexing of the same shard so I take it into a bit of code of solar cloud it seems the leader has keep a certain buffer so if while replica doesn’t acknowledge after certain maximum size it will wait trying to wait and stop sending to the rest of replicas so essentially because every second that we have thousands of posts we need to index so easily we causing a backlog so that the freshness of such results are largely kind of delayed so again we trying to solve this problem but with no avail so and we have deadlines to meet so we end up with actually go back to the plain solar and we wrote our home customized classroom management and as when I was just spying and we don’t really see any problem so of course at some point we we still probably still want to go back to review and test the solar cloud when other like Lich has been around but as per right now we stay with this basic solar solar for with some of our our customized code yeah so our take is that and the functionality wise I really like the design with exertion solar cloud all the easy needs to use but solar plane solar just seems to be a lot reliable in the production environment like us we have at the same time allow the indexing and a lot of search last thing I want to mention a little bit is one lesson we learned that tumbler is we have lots of data a lot of traffic seems best way to really to me the both ends is if the search is not time sensitive we should do a lot of pre computation because this really gives you the resource to do more sophisticated either processing or analysis and you don’t work don’t have this pressure you have to okay you only have 500 seconds go we don’t have this pressure in the offline time so of course the limitation is that if the application itself is the real time like the post search you have to it’s very recent Modi you have to give used up to like second example when I use a hit post they expect their posts have to show up like within a second so for that is not applied but for many other search applications like Peppa has it doesn’t change much the latest search is stable and recommendation blog recommendation again it’s not that kind of change of a second you wonder like looking at a different blog so for all those less time sensitive data we do a lot of pre-computation use our powerful Hadoop cluster so that’s one important lesson we also learned when we do it in the tableau search what’s next for tumblr search one big thing we want to do is we call it blog search so right now if you go to any tumblr blog the only thing you can search again similar like the post search the only type of one single tag and internally it’s again it’s a massive good table index with tag ID we think that a blog what do we want to do is that we want to provide a functionalities given any blog you can search not only original post possibly blogs and likes and the difference

between in blog search – like regular search is that it only searches within one blog so we should make a lot of things easier because when your index you can separate the data from different blogs so you only index one blog created like in the loose saying essentially they use different the color fields so and innovating generate the binary data format is that different fields they have totally independency so in the retrieving time you could be really fast and another big thing for us to do is to improve the ranking so we already started seeing some gaming some spamming be trying to gain the popularity of search results so we want to devise a set of more spam resilient and more effective signals and once we have more signals handcrafted scoring function it’s not going to be that great so we’re also trying to apply machine learning to use signals plus all the user feedbacks to to rank better of results in the recommendation side right now we may need allow more in that kind of a collaborative futuring side to generate the recommendation one thing we want to do is we we actually wanted to go more directly to see given user given the and the content can we model the topic directly so that we can using the topic to match between user and the content by understanding what are their interests it could be supervised it could be unsupervised but essentially we we trying to do based on interest and that would be another good supplementation to what we are doing right now the last thing is about comment discovery always we have more content that user can consume so how to detect a really interesting content and that user know and so they can more appreciate like tumblr has a big content Network what was the comedy might be interested so that’s a very big and ongoing efforts we want to do the better efforts on the content discovery so that’s pretty much I think talk so if you have any questions I would like to answer and before that as usually we are hiring so if you’re interested about seeing all this awesome content and the welcome to search please talk to me are some of my teammates yes for a ski so the question is about schema how do we in this example so right now the main data we has a post and the blog so between them we have them a schema but for the post the schema is really simple because we have tagged is one very important diffuse we also have title we have body tags we have caption all those things for right now we actually only differentiate the tag and the rest of the text so and for them we do some basic text analysis like subways removal and stemming like using the powder 2 algorithm to do STEMI then pretty much then we we keep just put on a solo then index created inverted index so so that’s how we index them yeah we don’t really have very complex Keima yeah audience you want to yeah I will let me repeat the question so because I think if I was recording this so the question is about when we move from one platform to another platform what are the versions do we try different versions of this one platform right is it correct oh we actually stick with the textures

for quite some time some features with density we actually was in the production when life our user was using that that’s why we have a lot of pressure to keep the live and running and and that’s why we had to bring artists and his team trying to really make this thing work because we really like like search has functionality but unfortunately with a lot of efforts we said the version available back then it doesn’t really follow a particular application doesn’t work it basically the failure like almost every probably every other week either we have one major failure which will cost them the damn time of that service for a few hours so I mean that’s just very Syria and and a week it seems there is no easy solution just to get rid of that with confidence that okay we we know what’s a good cause and here’s a solution and we cannot so yeah I mean for us we want to have the predict we can predict like this service available and even something goes wrong as long as we know the root cause I mean we are comfortable but this next search is too many magic happen inside so we decided to go with solar cloud which definitely provides more transparency that we can we can see what’s going on yeah hi can you repeat that again so the question is about how much the cost played into the decision to choose solar against other choice so pretty much the cost was we didn’t see any like increases for example number of machines performance wise they’re comparable I think so the castle was not a major factor for the decision reliability is really the the things eventually will lead us to the solar so that’s a good question so inserting we pretty much we we think we are language agnostic as we for example a flaw we have supported like different language the key thing is we try to build a good framework so that you can support in a very high level support the job which is written in different language or in the software for example we in online we have this using script as a bridge to from the PHP they can talk to different service for examining the Java or scalar but I think there’s also cast is that we do see the redundancy things has been written so I think it’s a it’s a kind of a trade-off in the beginning because people come to different background we need to get things done so I think we are willing to that happen but once we have a team and we’re building a lot of things within a company it’s probably actually good idea to consolidated things into wall to language because then you can reuse a lot of things so I think we are especially for the offline we one thing we want to try is we more want towards to the scouting because it has a lot of great properties and especially in terms of reuse of the usability and also the modular things seems pretty good for like a bigger team to work together with Alice yep yeah this definitely is a very good question the same is for search the main thing we we measure Rena is we measure some engagements about actually this happened only when we do the new search so but when we look at because humbler is a great social network so a lot of things we learn from our user is that we have certain queries we can we can see our people are very like speak out they whatever they don’t like they will put it up on tumblr so for us a lot of time when we release a new feature we just go to tumble and and see how our users

respond to features and of course we are prepared user tends to be whoever wrote post a little more probably out towards a little more negative to the new features but we we do see a lot of appreciation especially without that we take the UI initially because we also into this UI in the this grid format which which is great I personally like that more than the list because it allows you to look more and in the search results but a lot of people they like the list format it so well so they actually kind of like have a lot of negative comments but one as soon as we give them a choice they can either using grid are using the list most people are very very positive about new search functionality can now they can search Nolan just single type mode Italia they are also search like full tags so I think from that is its big win and at same time the search volume is increased but that’s less difficult because really when you have a better search sometimes you might see not necessary your search volume would jump because they might actually in the first search in the first page you already find out what they want so that’s kind of like double side but a better better measurements would be engagements which you right now we starting to report daily but unfortunately we don’t have men in that number before so moving forward what we want to do is for the search we also want to do like look emendation want to view this a bee testing framework so that we can see between different release hub are we moving this kind of matrix like two people like more often in the search results they follow other people’s blog more often things like that yeah okay I’ll probably go this side for yes yes to be honest this stain was so the question is about the serialization this particle where we choose stripped against other available was the quick answer is this decisions be made even before John publish yeah I think it works whatever it works is good yeah please yeah I didn’t get a second part of your question couple all right so the question is about how do we decide what to index whatnot index as I mentioned earlier right now the current here is maybe based on popularity we don’t look at a type of content we don’t try to kind of like please the content see okay this is bad kana we’re trying to index the content that leave the choice to use that we do have option for users to seek a safe search you can see better content and like or saved countless not as a better but safer content but also if you want to see our kinda you can see that for the choice is mainly for the engineering and discuss the region so we use popularity which means if people like your post more likely blog more than you skip chance we will index them and if we have really unlimited resource at a certain point we might just think that’s of them yeah please right as of right now we only have one data center so but even with some motive data center I think it won’t be a big issue because we still can so essentially we can have multiple file guys of exam oh sorry the question I want to repeat question is that is that what this replication replication schema is we’re saying what is it send our cross this enter the question the answer

is that ran out tumblr we only have one datacenters so the solution was within well they descend it but I think he that can also apply to model ascended by easily by lubricate in the file geyser level so we can have basically different and consume consume the identical stream and in a different data set that we can that we can build the same index because as well as the input is saying that index should be should be saying yeah please to be honest I think bicycle works pretty well but and I don’t have really much experience with the other database you you were mentioned but we have pretty good database team at tumblr and they have a lot of expertise on making Mexico work in a scale I mean we have a huge Mexico database because we have such great cars so the message content and so and again this decision was made before even long before I joined the tumblr so we at a southern part we might started to review this by secure without always working because at a certain point and maybe it was really get to a point we need to have a better technology but that’s right now it works pretty well so we did it in the care to the point to change that yet a plugger decided for take one question yeah great question so questions about a photo analysis a phone phone in okay I saw these photo analysis so we actually know the dancers from the phony analysis and we don’t do that we just didn’t get to yet so as I said there was a lot of things I probably add that to one that were actually to do list too you can see that out yeah thank you so the analysis actually is one very interesting area and especially recently there’s a lot of acquisition in this domain so in a yahoo which is when are we under this big family Yahoo they actually have a quite a few company we bought recently have specialized technology in analyzed the photo for example a check the simple thing like text but also trying to annotate the entity in the photo yahooo lab they they have something which is basically of course the Serenity research but in the in the various stages that for most celebrities actually given the photo was pretty good precision you might’ve able to identify them that actually because a lot of photo post at tumblr they may have don’t have tagged a very little tag because people some people des tends to publish good content without annotate them right so if we can actually use additional photo analysis to get more information so that can help search a lot yeah yes so for untag the content if it’s containing some other text it would become less of the issue but for the for example if the tag is only saying the contains is a photo then there’s a couple ways one ways analyze thought itself second is maybe you would go back you could do a certain way we can know okay where this photo comments from and for example some people do put a source of where’s photo comes from then while option we are looking at it is trying to go back to crawl that page to find the surrounding text so that’s another approach essentially is getting more text on that yeah yes please so the question is when we read X how do we swap the new index so for the post index we the only time where we do index is the post changed changed right so that’s essentially it’s a building by solar so you do handle like table new content because you know will actually

delete all the ones override all along so that’s not particularly we are worried about this and the only time is that when what we’ll do is we have to recycle the whole because we we said we keep it raised in the six weeks so when one week become like seventh week so we do have reusing Soler’s this call atom a kind of command so we kind of many who we have our index aside we have this policy to see we was a core become out of date we will delete them dynamically so yeah yes yes then we replicate them across different servers so right now a travel quois every week we create a new call so that’s we consider Charlotte yeah so for the replication as I said we are using plain solar again our index of managed this replication so the way we’re doing is we have we split a queue once we read post from our geyser we then we motive like multiplex them to different replicas and each of them added and ethical to each other but they don’t have any communication between them so that’s get rid of the complexity yeah yes yes and this all control by indexer so so that makes sense everything we can control we know exactly how the replication have a color shouting heaven and if something goes wrong we can finally do the cause and fix it I go this side for one question for the Klan decided for the top hat we do cash if you ever have this thing either will either will cash for the session but because our tab has really fast anything beyond that we don’t do so so pretty much it only do very basic cashing so yeah yes our this any misuse the colors oh okay okay for the indexing I think if I didn’t remember a number I always think it’s in actually in general it’s not a very big because we have so many content right the percentage I think and most if not more than 5% if I have to give a number I think he’s but it’s still even even 1% of the poster gonna be here and also really what some matters in a search time are they appearing in a search results or not you can have billions of content as well as they don’t show up in the search we don’t care but that’s where we do a lot of other search time duplication because that’s actually really the key we don’t want to use a see duplicates when they do search yeah yes please so the question is about how do we take like consider shopping when we ranking data so essentially right now the scheme is we because we shot a data based on recency so when we do in retrieving we actually give and because recent it’s also important in ranking so right now we usually will fetch more recent data only when the recent data is not enough that we go to our this color top popular petition now we attach additional ones and it was we have this the ranking is pretty much similar we combine recency petrel relevancy and popularity not right now but categorization is one thing your roadmap we trying to there’s a quite a few things like other search to the first we want to do a better job on analyze the query like the attack of entities and also what type of and it is because different type of entities might have we might need a different ranking or retrieving strategy right now there is no learnings all

handcrafted but the the thing we’re looking at a most like one looking at things like decision tree to learn the ranking ranking for another yes yes two more questions yeah because I think we have to close the office yes so the question is about how do we you know be testing what does really parking means right so right so the question is to be doing the right thing when we’re doing be testing when we do marketing I think you have probably misleading you a little bit is that when we do a/b testing first thing is that we put user into a percentages that data slashing is actually pretty much randomized it’s we don’t do any we’re just trying to have a very even distribution across our whole user base but once we get results when we’re doing ad analysis not only will look at the overall number in that user we’re doing we give a new version right but we also break down the user who are seeing new version in two different pocketed based on certain criteria so that we can better understanding how different by the user to respond to this yeah okay all right so last question Adam yeah last one so alright I think so oh yeah I keep get to yeah all right yes if you think each tag is a own category that’s we have something but we don’t really differentiate that much right now for the categorization we studying to work on the trending detection so we trying to categorize them a chance but that’s pretty much it will do more on that side so I just want to let you know we have a great bar close to Company harding’s and some of my teammates Adam over there and abandoned they’re going to go go to bar any of you any of you have inches to how you can find them I have to go home after kids sorry yeah and thanks again for coming and is hope you’ll enjoy