ArXiv

April 29, 2025

Discovering arXiv (pronounced “archive”) is a significant step in the life of a scientist. At some point, often as a student, you maybe first realize that journal articles describe in greater detail what a textbook might sometimes only briefly touch on. And later on, as a researcher, staying up to date in your field means more than reading these journal articles. You need access to ideas as they're being developed. That's where arXiv comes in.

arXiv is an open-access repository of preprints, i.e., research papers shared publicly before peer review. And as opposed to one might think, it wasn't always there: it only started in 1991. And arXiv was created by physicist Paul Ginsparg to actually address the slow and limited distribution of new results, especially in theoretical physics.

Since then, it has grown enormously. Today, arXiv hosts more than two million scholarly articles (from arXiv's About page), with hundreds of new submissions uploaded every day. It covers a wide range of scientific fields and has become a central hub for sharing and accessing research at its earliest stages.

As a person who likes to understand what kind of shoes a shoemaker wears, I was naturally drawn to examine the different articles that are stored on arXiv. That said, a dataset of more than two million scholarly articles seems to be better suited to a statistical study, rather than a detailed study of individual items. So I began investigating how to systematically gather information about these papers for further analysis. I quickly came across an open data repository that hosts the entire arXiv dataset along with its metadata. Not only is the dataset frequently maintained, but it is also free.

Due to the substantial size of the full dataset (over 1.1TB and continually growing), only a metadata file is actually provided on this repository. This file is in JSON format, and each of its entry corresponds to a paper and includes the following key elements:

id: arXiv ID
submitter: Who submitted the paper
authors: Authors of the paper
title: Title of the paper
comments: Additional info, such as number of pages and figures
journal-ref: Information about the journal the paper was published in
doi: Digital Object Identifier (accessible through https://www.doi.org)
abstract: Abstract of the paper
categories: Categories / tags in the arXiv system
versions: A version history (giving, among other things, the submission time)

I chose to focus my analysis on two key fields: the submission time contained in the versions, and the categories. The submission time, as reported in the arXiv's Submission Version Availability documentation, corresponds to a "date stamp", recorded as the time the submitter clicks "Submit article" at the end of the submission process. This field is unique, and is determined automatically. On the other hand, the categories are set by the submitter, and can be as many as needed. But by convention, the main one should appear first.

Before proceeding with my analysis, since I already knew which fields I wanted to retain from the original 1.5GB zipped (4.6GB unzipped) file, I looped through all entries and extracted only the necessary data in the desired format. This action decreased the size of the file to 500MB, which made things easier later on.

I first wanted to have a deeper look at these categories. I started by gathering and counting all the possible categories, which to my surprise totaled 176. For a brief moment, I was worried about how to show the possible results I could get from them, until I realised that the categories I collected contained also subcategories. Fortunately, these subcategories follow a defined pattern, written as [category].[subcategory], so it was relatively easy to extract their category. This additional extraction step left me with a total of 38 categories.

Still a bit high, isn't it?

I decided to have another look at the arXiv website, because from their About page, there are eight subject areas. That's when I found their Category Taxonomy page. Here, indeed eight categories are listed: Computer Science, Economics, Electrical Engineering and Systems Science, Mathematics, Physics, Quantitative Biology, Quantitative Finance and Statistics. But as a physicist, it seemed too simplistic to group together all physics categories I saw in the previous step, which span from condensed matter to high energy physics. Until I saw that unlike the other seven fields, Physics has several categories that are not considered as subcategories. And so the final number of categories was 19.

You may ask, how come we went from 38 to 19, even though in both cases we look at categories. It is because, after some research, several categories I initially found ceased to be considered as a category, and became in time, a subcategory. Since it is maybe not easy to follow which categories are kept, and which are not, I summarized the situation in the following table:

Category Found	Category(.Subcategory) Now	Name
acc-phys	physics.acc-ph	Accelerator Physics
adap-org	nlin.AO	Adaptation and Self-Organizing Systems
alg-geom	math.AG	Algebraic Geometry
ao-sci	physics.ao-ph	Atmospheric and Oceanic Physics
astro-ph	astro-ph	Astrophysics
atom-ph	physics.atom-ph	Atomic Physics
bayes-an	physics.data-an	Data Analysis, Statistics and Probability
chao-dyn	nlin.CD	Chaotic Dynamics
chem-ph	physics.chem-ph	Chemical Physics
cmp-lg	cs.CL	Computation and Language
comp-gas	nlin.CG	Cellular Automata and Lattice Gases
cond-mat	cond-mat	Condensed Matter
cs	cs	Computer Science
dg-ga	math.DG	Differential Geometry
econ	econ	Economics
eess	eess	Electrical Engineering and Systems Science
funct-an	math.FA	Functional Analysis
gr-qc	gr-qc	General Relativity and Quantum Cosmology
hep-ex	hep-ex	High Energy Physics - Experiment
hep-lat	hep-lat	High Energy Physics - Lattice
hep-ph	hep-ph	High Energy Physics - Phenomenology
hep-th	hep-th	High Energy Physics - Theory
math	math	Mathematics
math-ph	math-ph	Mathematical Physics
mtrl-th	cond-mat.mtrl-sci	Materials Science
nlin	nlin	Nonlinear Sciences
nucl-ex	nucl-ex	Nuclear Experiment
nucl-th	nucl-th	Nuclear Theory
patt-sol	nlin.PS	Pattern Formation and Solitons
physics	physics	Physics
plasm-ph	physics.plasm-ph	Plasma Physics
q-alg	math.QA	Quantum Algebra
q-bio	q-bio	Quantitative Biology
q-fin	econ.GN	Quantitative Finance
quant-ph	quant-ph	Quantum Physics
solv-int	nlin.SI	Exactly Solvable and Integrable Systems
stat	stat	Statistics
supr-con	cond-mat.supr-con	Superconductivity

All categories are contained in the red cells, and amount to 19. And even though I previously said that this number would be the final one, I prefered to decrease it to 15 by grouping together all the high energy physics fields in one category, and all the nuclear physics ones in another.

Then I took a look at the submission time. One thing I noticed is that it follows this timestamp convention:

[day of the week], [day] [month] [year] hour]:[minute]:[second] GMT

GMT stands for Greenwich Mean Time, and is most probably used in this context as one of the names for the basis of Coordinated Universal Time (UTC+00:00). As the location of the submitter is not provided, no local time could be deduced. Nevertheless, two key dates were identified, the earliest and most recent publication dates, allowing the determination of the overall time span of the dataset. These are the following: Friday, 25 Apr 1986 at 15:39:49 GMT and Thursday, 10 Apr 2025 at 17:59:59 GMT. The first paper in this dataset is, give or take a few days, 39 years old!

With these two ingredients carefully studied, it was time to let the dataset speak.

My first step was to explore how the number of publications has changed over time. Especially because I was interested to know how more than two million scholarly articles ended up on arXiv. And because I gathered the necessary information, I wanted to split the publications into the different categories I identified, i.e., 15.

To make the histogram more stable, I chose to define each bin starting at the beginning of an even month, with a width of two months. One last detail is that the histograms are stacked according to their total number of publication. This means that the first category appearing on the legend is the first starting from the bottom, and corresponds to the largest category. The following smaller categories are stacked above, one after the other. This gave the following first stacked histogram:

Now, since the categories sitting on top are not easily readable, I decided to smooth the histogram bins with a gaussian:

Much better.

First, let me comment the legend. I wanted to explicitely give the name of the categories I found, and grouped them according to their category. In the next plots, I will indicate only the categories, so the labels will slightly change. If you get confused, just remember that the labels are sorted from the largest to the smallest category, and therefore their order remains the same throughout this analysis.

Now the evolution of the categories. It seems that Computer Science (cs) and Mathematics (math) have a very steady and rapid growth compared to the rest of the categories. The next two categories, namely Condensed Matter (cond-mat) and Astrophysics (astro-ph) also show a very interestingly similar growth that is steady, but less rapid than the previous two. High Energy Physics (hep-ex, hep-lat, hep-ph and hep-th) grew rapidely at the beginning, and later on remained constant, forming a plateau since several years. The other remaining categories are showing a slow growth, even though some categories started earlier, like Nuclear Physics (nucl-ex and nucl-th), and other later, like Electrical Engineering and Systems Science (eess). Also I wasn't aware that since early 2000s, Economics (econ) had a dedicated category!

I agree that these observations are not immediately obvious in the previous plot, as it takes some practice to spot them. To make things easier, two alternative visualizations are provided below. The first one displays the categories as separate lines (not stacked).

As you can see, the majority of the categories have a number of publications below 3000 per two months. It was therefore considered preferable to display two separate scales to provide a clearer view of the situation.

The second additional plot shows the same categories as a percentage of total publications over time. In that case, several things can be noted. Although the earliest recorded publication dates back to 1986, there appears to be a gap in publications between 1986 and 1989. This may be due to incomplete archiving during arXiv's early years, challenges in retrieving older records, or the possibility that some entries were added retrospectively. Another interesting aspect is observing which category dominated during different periods. In the early 90s, High Energy Physics (hep-ex, hep-lat, hep-ph and hep-th) clearly held the largest share, whereas today, Computer Science (cs) has taken the lead.

My next question was whether submissions are more frequent on certain days of the week. To answer it, I needed to look at the distribution of publications per day. Again, I wanted to know if a certain category was more represented than another, so I also divided the publication into the different categories I mentioned earlier. And I sorted the categories by their size, going from the largest at the bottom, to the smallest on top. This gave the following result:

I have to say that this pattern was partially expected. Indeed, even if the submission time is given in GMT, the "5 days on, 2 days off" model is widespread (though not universal), so less publications were expected on the days off. But is there a specific hour that is more represented than another during these days? Since nothing stricking or non-redundant could be said about the different categories, this question motivated the need for this next plot:

Visually interesting? Certainly, yes. Easy to read? Not really.

Even though I put a lot of effort into making the hours readable, with the choice of a specific color map, I had to admit it is not really easy to understand what is going on, or derive any trend from it. So I changed the strategy. Instead of having a stacked histogram on the days of the week, I wanted to have a distinct histogram representing the number of publication per 20 minutes, and this for each day. Once the histograms have been smoothed with a gaussian, it gave the following:

Way more readable, right?

To make the x-axis more accessible, I decided to add two additional time offsets from UTC. They were chosen by the fact that many people are concerned by these offsets. But still, even though the trend we see is very interesting, it is not easy to interpret it.

There are definitely 5 peaks, situated at around: 2:45, 9:30, 15:30, 18:15 and 20:15 GMT. One first tendency would be to search if these hours correspond to the end of the day, when people try to hurry and submit their paper before going home. But without the location, it is not possible to draw that conclusion. Actually, no conclusion about these peaks can be drawn. And searching for the most populated scientific universities, and trying to conclude something out of it won't help either. But one thing that can be said with more certainty is that several papers are submitted inside the 2 days off.

What about the categories? Will they all show the same trend? I let you have a look at the next and last plot, showing a distinct histogram per category, in the same condition as previously described:

Interestingly enough, not all categories show the same amount of peaks. For example, Mathematics (math) shows 3 main peaks, with a tiny bump on the left side of the last one. Astrophysics (astro-ph), on the other hand, is much less smooth between 17:00 and 21:15 than the other categories. But again, more clues would be needed to interpret these curves fully.

The conclusion of this blog post remains positive. Despite the lack of possible conclusion on the plots, it was nonetheless possible to produce them, and with only two ingredients: submission time and categories. Obviously, more details would lead to a more interesting discussion, but I already find it impressive to be able to obtain results on a very simple (yet voluminous) dataset.