<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>Repository Collection: null</title>
    <link>https://scholar.dgist.ac.kr/handle/20.500.11750/58909</link>
    <description />
    <pubDate>Tue, 21 Apr 2026 13:27:18 GMT</pubDate>
    <dc:date>2026-04-21T13:27:18Z</dc:date>
    <item>
      <title>Investigating Sparsity of Self-Attention</title>
      <link>https://scholar.dgist.ac.kr/handle/20.500.11750/60039</link>
      <description>Title: Investigating Sparsity of Self-Attention
Author(s): Martin Garaj; Kim, Kisub; Alexis Stockinger
Abstract: Understanding the sparsity patterns of the self-attention mechanism in modern Large Language Models (LLMs) has become increasingly important for improving computational efficiency. Motivated by empirical observations, numerous algorithms assume specific sparsity structures within self-attention. In this work, we rigorously examine five common conjectures about self-attention sparsity frequently addressed in recent literature: (1) attention width decreases through network depth, (2) attention heads form distinct behavioral clusters, (3) recent tokens receive high attention, (4) the first token maintains consistent focus, and (5) semantically important tokens persistently attract attention. Our analysis uses over 4 million attention weight vectors from Llama3-8B collected over long-context benchmark LongBench to achieve statistically significant results. Our findings strongly support conjectures regarding recent token attention (3) and first-token focus (4). We find partial support for head clustering (2) and the Persistence of Attention Hypothesis (5), suggesting these phenomena exist but with important qualifications. Regarding attention width (1), our analysis reveals a more nuanced pattern than commonly assumed, with attention width peaking in middle layers rather than decreasing monotonically with depth. These insights suggest that effective sparse attention algorithms should preserve broader attention patterns in middle layers while allowing more targeted pruning elsewhere, offering evidence-based guidance for more efficient attention mechanism design. © 2025 The Authors.</description>
      <pubDate>Wed, 29 Oct 2025 15:00:00 GMT</pubDate>
      <guid isPermaLink="false">https://scholar.dgist.ac.kr/handle/20.500.11750/60039</guid>
      <dc:date>2025-10-29T15:00:00Z</dc:date>
    </item>
  </channel>
</rss>

