[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Project discussion log
vr34> Hi Jesus <jgbarah> Hi! <vr34> So, let me summarize what I have understood of the project? <vr34> i am able to run perceval on the mbox link given <vr34> i need to use perceval.backends.mbox while using it in a script rite? <vr34> also, to get archives analyzed - on what basis is this done? <vr34> also could you suggest a step-by-step approach for this microtask? <jgbarah> Well, now it is perceval.backends.core.mbox, if I'm not wrong, but yes, right <jgbarah> The idea is: when you analyze with Perceval, you get a JSON document per message. <jgbarah> Those documents will be uploaded to ElasticSearch. <vr34> all documents to the same index? <jgbarah> Once they are in ElasticSearch, they will be annotated by thread <jgbarah> (Yes, all documents to the same index) <jgbarah> There is a well known algorithm for anotating the threads, we will use it <jgbarah> The annotation could be done directly before uploading to ElasticSearch, but that has problems, <jgbarah> such as that in many cases, threads spawn several archive files <jgbarah> So it is better first upload to ES, then retrieve the index and analyze <vr34> Oh okay.. <jgbarah> For uploading / downloading to ES, you can use elasticsearch-dsl, a Python module for ES <jgbarah> For uploading,, you can try with uploading document by document, but if possible, use the bulk mode <jgbarah> (there is a helper module provided by elasticsearch-dsl for that) <vr34> sure, got it <jgbarah> For downloading, you could get document by document, maintaining state in your program while you annotate <jgbarah> And uploading in batches, once you re done (or just use the bulk helper with a Python generator) <jgbarah> The result should be a thread id for each message, which should be always the same. <jgbarah> For example, it could be the unique id of the first message in the thread <vr34> is this like a primary key for each of the messages? <jgbarah> (I mean, the Message-ID of the first message in the thread) <jgbarah> Each email message should have a Message-ID field, which should be unique. That one. <vr34> okay <jgbarah> Example: Message-id: <6c195e50-0fae-5008-4f34-df5bc7231d38@xxxxxxxxxxxx> <jgbarah> Maybe more clear now? <vr34> Yes, a lot clear now! <vr34> thanks a lot <jgbarah> Great! You're welcome! <vr34> reg the microtask <vr34> could you suggest what i could work on everyday? <vr34> i can start with generating the json file output from perceval <vr34> and dsl today <jgbarah> Yes, please. You can organize as you may want, and I will be happy to receive your progress messages every day, or when you feel connfortable <jgbarah> The pace depends on you. <vr34> okay, sure. <jgbarah> I would start by writing a simple script parsing a file, given its url (or if its fille name, if iyou prefer) <jgbarah> You have an example in the GrimoireLab training manual <jgbarah> Then, I would improve the script to upload the documents to ES <jgbarah> Then, I would write a script to download documents, annotate each with anything, and re-upload them again to a new index <jgbarah> Just to become familiar with ES and elasticsearch-dsl, and if possible with the bulk mode <jgbarah> Then, I would improve that script to run the threading algorithm <jgbarah> And when you're done with that, you're done ;-) <vr34> Ah okay! <jgbarah> Anything else? <vr34> That's about it! Thank you so much! <jgbarah> Oh, I forgot to mention: this should work kwith Python3 and ES 5.3, if possible <vr34> Right <jgbarah> For Perceval, use the latest version available via pip <jgbarah> Very likely there is going to bbe a new one during the weekend <vr34> Oh okay, i will update it to the latest version <jgbarah> And a final note: for transparency and reference, please get the log of this session, and send it to the xen mailing list <vr34> another question <jgbarah> copying Lars and myself <jgbarah> Yes please <vr34> Sure <vr34> would we be using logstash in this project for parsing logs? i didnt see any mention of it anywhere <jgbarah> No, we use Perceval to parse (mbox files in this case), and then upload directly with Python <jgbarah> You can say that you´re writing your own LS ;-) <vr34> haha okay! <jgbarah> Nothing else on my side. Anything else from you? <vr34> do i give updates on irc/mail? <jgbarah> Please update by mail, since that works asynchronously, and ping me on irc whenever you find me, if you need it <vr34> Sure, thanks a lot! <jgbarah> We can schedule irc slots when you need. <jgbarah> Thanks to you for your interest with this project <jgbarah> See you! <vr34> and thanks for helping me contribute! See you ! <_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |