Circumvent arXiv LaTeX Detection

Sep 9, 2023

If anyone from arXiv is reading this, I implore you: we are both researchers who have better things to spend our time on than playing cat-and-mouse games for hours trying to figure out useless detection and circumvention methods. An author should have the prerogative to present his final work instead of being forced to feed his every draft and script into an automatic machine that may or may not work. Even though you claim it is for the sake of archiving, please remember: when the PDF format becomes no longer relevant or viewable, TeX will likely have faded out too. Hence, the current requirements are only good for facilitating the collection of large datasets for AI training, which every author should have the right to consent or object. Please stop imposing your perspective on every author. Is that too much to ask?

So arXiv currently prevent people from uploading compiled LaTeX PDFs for the sake of archiving (see also: How to upload LaTeX-generated pdf paper to arXiv without LaTeX sources). I did some experiment and it seems that the detector checks for metadata and embedded font of the PDF. Fortunately, both are easy to obfuscate to a certain degree:

Hide the Metadata

There are infinite ways to erase the metadata of any file. I borrowed this one from here.

pdftk $PDFFILE dump_data | \ sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | \ pdftk $PDFFILE update_info - output clean-$PDFFILE

Obfuscate the Font

So currently LaTeX uses CMR as the default font, which the detector checks. You can use pdffonts $PDFFILE to check what font your PDF has.

Fortunately, there are many fonts that look and behave like CMR. One way is to use the newtx font:

\RequirePackage[T1]{fontenc} \documentclass[conference, compsoc]{IEEEtran} \usepackage{amsmath} \usepackage{amssymb} \usepackage{amsthm} \usepackage{newtx}

Alternatively you may also try lmodern or any other font that you find satisfying.

Swap out the File

Another way to bypass the detector is up upload a shell LaTeX project that contains the original PDF. Unfortunately directly including a PDF has been detected and banned (what cat-and-mouse game are we playing now, huh?) But this method from here still works:

Note: you need to install and run pax first to get internal links in the page.
% cSpell:disable \documentclass[letter]{report} \usepackage{graphicx} \usepackage{pgffor} \usepackage[margin=0in]{geometry} % remove all margins \usepackage{pax} % process .pax file to bring back internal links \newcounter{pdfpages} \newcommand*{\getpdfpages}[1]{% \begingroup \sbox0{% \includegraphics{#1}%% \setcounter{pdfpages}{\pdflastximagepages}% }% \endgroup } \usepackage[hidelinks]{hyperref} % remove ugly link borders % Add metadata to resulting file. \hypersetup{ pdfauthor = {Dmytro Bogatov <dmytro@bu.edu>}, pdftitle = {Secure and Efficient Query Processing in Outsourced Databases}, pdfsubject = {Doctoral Dissertation}, pdfkeywords = {OPE, ORE, Range Query Protocols, Epsolute, kNN}, pdfcreator = {LaTeX with hyperref package}, pdfproducer = {dvips + ps2pdf} } % Adapt to your case. \usepackage[ backend=biber, style=alphabetic, giveninits=false, sorting=nyt, maxbibnames=1000, maxalphanames=4 ]{biblatex} \bibliography{bibfile} \begin{document} % This is to remove the first blank page. Feel free to improve. \vspace*{-4ex} \getpdfpages{file} \foreach\x in {1,...,\value{pdfpages}} { % chktex 11 \begin{center} \includegraphics[width=\paperwidth,keepaspectratio,page=\x]{file} \end{center} } % This is to generate all the citations from your bib-file to *.blg. \nocite{*} \end{document}

Future Prospects

I think it is possible that someone from arXiv may try to improve the detector. Nevertheless I feel it is ultimately untenable to try to catch all the LaTeX PDFs (worst case I can open an MS Word and typeset the same thing). I also think it is morally wrong to do so, as an author should have the right to present the work in the way they wished (PDF and TeX make no difference to any reader today other than getting feeded to AI). Given that arXiv is the only influential self-publishing platform for some field (so there is no suitable alternative and it cannot be claimed that 'you can always use another platform'), it is sad that they decide to force this wrong policy.