Menu directory status & updates copyrights help

Howell - Anthony Fauci's corona virus emails, code for text [rendition, analysis]

Classical wisdom : "...   If it ain't broke, don't fix it.   ..."
My practice :         "...   If it ain't broke, break it.   ..."
Trust me(???!!!) : Don't try this at home, leave it to the idiots...


Table of Contents



Summary

Quick & Dirty software was thrown together 25May-30Jun2021 to convert emails in a pdf file to text format, suitable for import into email applications (eg [Thunderbird, Evolution, etc]) that use mbox format. The initial date precedes the date when I first found a copy of the Fauci emails pdf file online.

The main challenge was to address seemingly random [insert, delete]ions of spaces throughout the transformed text. The Adobe Acrobat software probably already avoids this problem altogether. So you might ask : How could I [be so stupid, spend so much time] by doing something so useless? Please let me know when you figure out the answer - I could use the help.

The next stage software development, to [process, analyze] the emails, has only been touched on, and will be posted at a later stage of development.



Background : Informed Consent Action Network (ICAN)

Informed Consent Action Network (I had never heard of them before) :

"... ICAN OBTAINS OVER 3,000 PAGES OF TONY FAUCI’S EMAILS
Jun 04, 2021, 03:53ET
Last year, ICAN made FOIA requests to NIH for documents regarding COVID-19, including two requests for Anthony Fauci’s emails. ICAN has received nearly 3,000 emails sent by Fauci from early February 2020 through May 2020. Read what Fauci was saying privately about masks, therapeutics, vaccines, ventilators, and many other COVID-19 topics.

On April 10, 2020 and May 5, 2020, respectively, ICAN submitted the following two FOIA requests: When NIH failed to respond to those requests, ICAN brought a lawsuit against the agency on June 29, 2020. In response, NIH agreed to produce Fauci’s emails on a rolling basis. To date, we have received 2,957 pages of Fauci’s sent emails dated between early February 2020 through May 2020 and will continue to receive email productions on a rolling basis.
Read Fauci’s emails here and a few highlights from these emails are outlined below: ..."

(Howell : I haven't listed them here. Just click the link)

Of course, many other lists of questions can be found in amateur blogs, and once in a while good comments can be found on alternative online media. Most online material by amateurs is short-of-inspiring, but the "less-than-one-in-ten-thousand" certainly outshine the professionals. Maybe I'm too cynical, but it seems that it's rare that the mainstream media does much beyond catering to the politically-correct beliefs of their [sponsor, subscriber, public mob, university]s. So you can't be lazy : actively seek out good stuff for issues that are important to you, or sit back and absorb the mainstream crap. But most important of all, do your homework, think critically, and don't just take the words of experts.

I found a pdf of the emails ~??Jun2021 in a pdf document that was posted by Jason Leopold. See also the article by Natalie Bettendorf, Jason Leopold 01Jun2021 "Anthony Faucis Emails Reveal The Pressure That Fell On One Man".


Email processing - overall steps

The one-liner descriptions below are taken directly from the "List of operators" of the two key QNial programs for this project. As such, they are an ACTUAL meta-level description of the processing of the pdf email compilation. I admit that this is a horrible excuse for a description, but I've done worse. If I get really excited, I might even do a better job of it.


*********************
loaddefs link d_Qndfs 'emails - convert pdfCompilation to text.ndf' - convert email pdf to text
+-----+
convert pdf of emails to text, initial clean up
pdf_convertTo_txt IS OP pinn pout - convert pdf file to text
pEmails_1stclean_pout IS OP pinn pout -
getHead_from_lines IS OP fout - build email header when From: line has content
getHead_from_blankLines IS OP finn fout - build email header when From: line is empty
pEmails_fixHeads_pout IS OP pinn pout - produce unified, coheherent email headers
pEmails_fixBodys_pout IS OP pinn pout - clean up [mis-spaced words, junk-infested lines]
+-----+
reduce word corruption due to [add,removed] spaces by pdftotext, add intro
nDat_indxsSumToNdat_get_ij IS OP numL num - returns indices of numL, sum(numL@(i j)) = num
nDat_indxsSumToNdat_get_ij_test IS - res ipsa loquitor
pEml_addIntro_pout IS OP peml pout - add introdution to pEmailsRaw
+-----+
Extract contentTypes from p_emailsClean
pEmails_get_pContacts IS OP pinn pout sedExpr title introL - sorted list of [To,From,CC] contacts
pEmails_get_pSubjects IS OP pinn pout sedExpr title introL - extract [To,From,CC] subjects
pEmails_saveTo_dMboxDirsPaths IS OP pinn dout - extract each email as a separate file in dEmails
assumes that dEmails already exists - overwrites what is already there
+-----+
Do it all...
pEmails_doALL IS - res ipsa loquitor


*********************
loaddefs link d_Qndfs 'dictionaries.ndf'
+-----+
[create, process] dictionaries
urlL_make_pDic IS OP urlL pdic - create a wordL from a list of urls
pdwd_pdic_extract_pDif IS OP pdwd pdic pDif - create a specialized dictionary not in old dictionaries
pDicInn_removeApoLines_pDicOut IS OP pDicInn pDicOut - remove lines with apostrophes (apos)
pDicL_catSortUnique_pDic IS OP pDicL pdic -
+-----+
repair p_text using dictionaries (eg pdf files)
pdic_pWrdSorted_merge_pDicWrd IS OP pdic pWrdSorted - [merge, sort] prepTo fix [split, merged] words
fragL_subFragL_getNonNull IS OP nFrag_fragL_subFragL - file read of variables
pTxt_pDic_extract_pfrag_pFragSubs IS OP ptxt pdic pfrg psub -
pClean_replace_pFragL_pSubFragL IS OP pFragL pSubFragL - file read of variables


*********************
The operator "" listed above is worth looking at in more detail. Write statements (output to the terminal screen) help here :
write link timestamp_YYMMDD_HMS '-> generate a sorted list of wordFrags
write link timestamp_YYMMDD_HMS '-> merge pdic and pwrd, sort to pdwd
write link timestamp_YYMMDD_HMS '-> read pdwd, - write [, sub]FragL to prawe
write link timestamp_YYMMDD_HMS '-> read praw, build rawe[, Sup]FragL
write link timestamp_YYMMDD_HMS '-> "invert" raweSupFragL to findSubFragL
write link timestamp_YYMMDD_HMS '-> "right lengthPairs"
write link timestamp_YYMMDD_HMS '-> strL_write_pout [, sub]FragL
write link timestamp_YYMMDD_HMS '-> strL_write_pout



Proper word lists

(slang term : dictionaries, I shouldn't use this phrase)

Proper word lists play an important part of the text repair process.


pTxt_pDic_extract_pfrag_pFragSubs
I need a list of common [begin,end]ings of words
[adjective, noun], [adverb, verb], [conjunctive, prepositions]

Clean dictionaries --+-----------------------+---------------------------
| | | |
| ----------------->- recombine |
| | | | |
V | | | |
diff---> frags ----->+---->+ breakup | |
^ | | |
| | | |
Dirty word list -----| V V |
fixBlends fixSplits |
V
Manual collection of [good, new] words ----------->+-----------------> diff--> vaccine wordList
^
|
|
[url, document] sources of vaccine words -----------

diff :
- perhaps cut off <2 chrs?
- can always search smallFrags later



[url, document] sources of vaccine words, 28Jun2021 starting point :
CDC ???


Software [language, utility, tool, header, program]s

A hybrid environment was used to write the programs : I've extracted text from pdf files many times. Perhaps the biggest challenge I've done is to auto the editing of thousands of pdf files (keeping them in pdf format) for either the "Word Congress on Computational Intelligence (WCCI)" or the "International Joint Conference on Neural Networks (IJCNN)" (see Authors' Guide and software). As usual, many pdf files had to be iteratively corrected with the authors' feedback. Nobdy can be expected to get all of the formats right, and they don't have some of the required material (eg copyrights, etc).
Several operators (same as [function, procedure]s in other computer languages, short term I use is "optr") were specifically developed for this project, but they were put into generic form in other files for use in other projects, including :

strings.ndf : fileops.ndf : 'QNial setup.ndf'
My core QNial (.ndf) library files are, as usual, used extensively : Notes taken during development My notes are of not much use to anyone else, other than as proof of my stumbling in the process. Still, rare parts of it may help others.




[small, fun] stuff I've learned


sed expression wrapping for [view, correct, remind]ing

As an [old, fat, bald] guy, my [eye, finger, ear]s are fine, its just what's between them that is failing. Short [grep, sed] expressions are like buttering a slice of bread, long expressions are like being lost in acreas of thorn bushes. It was a relieve to put in the tiny effort to make them more [read, edit]-able. Here are two examples :

#] sed_getContacts - reformat email headers [Date, To, From, Cc, Subject], pEmails_get_pContacts

tbl :=
'extract lines of interest'.......'s/^From: //I;s/^To: //I;s/^Cc: //I'
'get rid of multiple spaces'.....';s/[ ]+/ /g'
'get rid of spaces within ()'....';s/((.*) (.*))/(12)/g'
'get rid of apos'.................(link ';s/' chr_apo '//g')
'get rid of quotes'..............';s/"//g'
'problematic lineStart1'.........';s/^ //g'
'problematic lineStart2'.........';s/^.+//g'
'firstname tighten'..............';s/ , /, /g'
'lastname tighten'..............';s/, ((/, (/g'
'delete title for alphaSort'.....';s/Dr. //'
;

n_cols := 2 ; n_rows := (floor ((gage shape tbl) / 2)) ;
sed_getContacts := link second cols (n_rows n_cols reshape tbl) ;


#] organization acronyms, pdftotext
# frequent mis-recognitions : (=[({] )=[)}]

sed_orgAcronym :=
'(NIH/CC/DLM)'....';s/[({]N[1IJlTf]H/CC/[D0]LM[)}]/(NIH/CC/DLM)/'
'(NIH/FIC)'.......';s/[({]N[1IJlTf]H/F[1IJlTf]C[)}]/(NIH/FIC)/'
'(NIH/NCI)'.......';s/[({]N[1IJlTf]H/NC[1IJlTf][)}]/(NIH/NCI)/'
'(NIH/OD)'........';s/[({]N[1IJlTf]H/[O0][D0][)}]/(NIH/OD)/'
'(NIH/VRC)'.......';s/[({]N[1IJlTf]H/VRC[)}]/(NIH/VRC)/'
'(CDC/DDID/NCIRD/OD)' ';s/[({]C[D0]C/[D0][D0][I1][D0]/NCIRD/[O0][D0][)}]/(CDC/DDID/NCIRD/OD)/'
'(CDC/OD)'........';s/[({]C[D0]C/[O0][D0][)}]/(CDC/OD)/'
'(OS/IOS)'........';s/[({][O0]S/[1IJlTf]0S[)}]/(OS/IOS)/'
'(OS/ASPR/IO)'....';s/[({][O0]S/ASPR/[1IJlTf][O0][)}]/(OS/ASPR/IO)/'
;
;

n_cols := 2 ; n_rows := (floor ((gage shape sed_orgAcronym) / 2)) ;
sed_orgAcronym := second cols (n_rows n_cols reshape sed_orgAcronym) ;

Note that the periods sequences in the middle are spaces in the QNial program code, but are put there for easy html alignment (not the right approach, but fast).

This combines sed expressions :

#] sed_pdftotext reformat, pdftotext

sed_pdftotext := link sed_format1 sed_formatHeads sed_orgAcronym ;


sed_pdftotext is used as the argument "sedExpr" in the QNial operator (function, procedure equivalent) :
pPdf_convertTo_pTxt IS OP pinn pout sedExpr
This is in the form of a host call from QNial, my usual simple way of hybridizing [QNial, bash] :
host link 'pdftotext "' pinn '" "' p_temp '"' ;
host link 'sed "' sedExpr '" "' p_temp '" >"' pout '" ' ;



Inversion of 1-level-nested lists of strings

OK, so maybe this is an ancient achievement of computer science, with jillions of interesting optimisations, but it is interesting for me. I have blindly applied brute force approaches in 'dictionaries.ndf' -> pTxt_pDic_extract_pfrag_pFragSubs. This project "inverts" for a list of words [merge, sort]ed from words in the text from the pdf, and a standard Linux american-english dictionary (list of words). The challenge is simple on the surface, but for large enough text files, dictionaries] could become a challenge. I suspect this may be "order of N-squared" (O(N^2), where N is the number of data).

This reminds me of Moore-Penrose matrix inversion, a challenge for large datasets with Guangbin Huangg's "Extreme Learning Machines" (ELMs).


Software and tools



Future potential work



Warning, waiver

WARNING - the pdftotext process produces many errors, especially space-gaps in words.
13Jun2021 This process requires a fair amount of un[certain, finish]ed "cleaning" to get useable results.
PRIVACY concerns : the "names" of email addresses have been removed (name@affliliation)

Waiver/ Disclaimer :
The reformatting of this document does NOT reflect the [past, current, future] cart [policy, priority, direction, opinion]s of [this author, employer, work colleague, family, friends, acquaintance]s. This reformat has NOT been approved nor sanctioned at any level by any person or organization, nor has it been checked for errors. The reader is warned that there IS a [warranty, guarantee] as to the accuracy of the reformatting herein : it sucks! The application of this reformat could quite possibly result in severe losses and/or damages to the [author, reader, associate, organization, country, entire human specie]s. The author accepts no responsibility for damages or loss arising from the application of [any, all] part of this reformat, neither for the reader nor third parties. This webPage is one fruit of my madness, and it would be mad to take it for anything else, and if you did, then who is maddest?