Disk Usage Report much smaller than disk space consumed | Community
Skip to main content
June 27, 2019
Solved

Disk Usage Report much smaller than disk space consumed

  • June 27, 2019
  • 25 replies
  • 11820 views

We have an AEM 6.2 instance with Hotfix 17578 (cq-6.2.0-hotfix-17578) installed so we are on Oak 1.4.17 and CFP18.

We run garbage collection, version cleanup, and workflow cleanup daily.  We have a separate datastore and segment store.  We have performed offline compaction but that only applies to the segmentstore.

The disk usage report in AEM (/etc/reports/diskusage.html) reports we are using ~18 GB of data.  However, our datastore has grown to ~170 GB of data.  I cannot find any way to figure out how to reduce this or where this is coming from.

From this page (Analyze unusual repository growth ) I can see we even have a few files between 1-6 GB but there is definitely no file in our DAM or packages that big.  The entire DAM according to the usage report is ~6 GB.

What can I do to reduce the size of our datastore?  What is causing this problem?

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by

Here is what I have done that seems to have resolved the issue.  If someone could let me know if there is some issue I am not seeing here but otherwise, it seems to have worked beautifully.

java -jar crx2oak-1.8.6-all-in-one.jar segment-old:/content/aem/crx-quickstart/repository segment-old:/content/backup/ --include-path=/ --src-datastore=/content/aem/datastore --datastore=/content/backup/datastore/

Running this command to pretend as if I am upgrading, but using "segment-old" for source and target, I was able to create a repository that is ~11 GB compared to the previous ~170 GB and all seems to work successfully after.

The only concern is this page (InvalidFileStoreVersionException migrating from older version to 6.3 using CRX2Oak ) only mentions using "segment-old" for the source repository but it doesn't seem to cause a problem with the destination repository.

25 replies

June 27, 2019

Wouldn't a repository consistency check, which we also run daily, catch these corruptions?  I will try this and follow up.

June 27, 2019

All files over 100 MB are zips.  I have inspected one of the files over 6 GB.  They look like packages that were backups, but these packages do not exist anymore.  I have checked all packages from the Package Explorer multiple times to confirm.  Is there something I can do about these files in the meantime?  Why would they not be garbage collected?

This would be helpful in cleaning up at least a few GB, but the question about the over 150GB+ will still remain. :-/

Employee
June 27, 2019

One method you can use to try:

- Clone your current AEM server to a separate server.

- Delete crx-quickstart/repository/index folder.

- Run offline compaction, but use the rm-all flag, instead of rm-unreferenced. This will cause all indexes to be deleted.

- Start your server. Upon server startup, all indexes will be rebuilt and it will be a slow startup. Wait until the server is fully up.

- Run datastore GC.

- Run the disk usage report and compare its result with the actual disk size.

If you have corrupt index data, then the above should resolve it. Corrupt index data might give you incorrect results from the disk usage report. It could also cause datastore GC to allow content to persist when it should not.

June 27, 2019

Not a shared datastore.  How can I answer your question about embedded or external TarMK datastore? I believe it is a TarMK datastore, the run modes include crx3tar.

Employee
June 27, 2019

If you want to figure out what those 1-6GB files are, you can run the Linux/Unix command "file" on those files, and it will identify what type of file it is.

for example, when I run it on a blob in my datastore, I get the following, which indicates it is a JPEG:

$ file 1677c4fff0d5c7b5f7788edcb549639d60d5c44a4aff101dcd830a7b16e653a0

1677c4fff0d5c7b5f7788edcb549639d60d5c44a4aff101dcd830a7b16e653a0: JPEG image data, JFIF standard 1.01, resolution (DPI), density 300x300, segment length 16, Exif Standard: [TIFF image data, big-endian, direntries=12, height=2848, bps=0, PhotometricIntepretation=RGB, orientation=upper-left, width=4288], baseline, precision 8, 1626x1080, frames 3

I then copy that blob and rename it to image.jpg and open it, and I can see which image it is. This might give a clue as to where that image is coming from.

June 27, 2019

I have seen that post and the linked maintenance document before posting and performed all those operations, including audit log purge (nothing older than five days).

antoniom5495929
New Participant
June 27, 2019

Hi,

thanks for info.

Are you using AEM with TarMk with external or embedded datastore?

Are you using a shared datastore?

Let us know.

Thanks,

Antonio

June 27, 2019

Yes.  I performed offline compaction.  Then I did various ways of garbage collection.  I turned off the application, ran compaction, turned it on, and ran garbage collection, including the following command which should be clear that I have.

curl -silent -u username:password -X POST --data markOnly=false http://localhost:4502/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3Drepository+manager%2Ctype%3DRepositoryManagement/op/startDataStoreGC/boolean

27.06.2019 18:10:36.159 *INFO* [qtp1670602695-1929] log.access 127.0.0.1 - admin 27/Jun/2019:18:10:36 +0400 "POST /system/console/jmx/org.apache.jackrabbit.oak%3Aname%3Drepository+manager%2Ctype%3DRepositoryManagement/op/startDataStoreGC/boolean HTTP/1.1" 200 201 "nt" "curl/7.29.0"

27.06.2019 18:10:36.160 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Starting Blob garbage collection with markOnly [false]

27.06.2019 18:10:36.207 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (2048) blob references

27.06.2019 18:10:36.248 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (4096) blob references

27.06.2019 18:10:36.266 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (6144) blob references

27.06.2019 18:10:36.294 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (8192) blob references

27.06.2019 18:10:36.314 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (10240) blob references

27.06.2019 18:10:36.330 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (12288) blob references

27.06.2019 18:10:36.346 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (14336) blob references

27.06.2019 18:10:36.362 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (16384) blob references

27.06.2019 18:10:36.377 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (18432) blob references

27.06.2019 18:10:36.394 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (20480) blob references

27.06.2019 18:10:36.410 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (22528) blob references

27.06.2019 18:10:36.425 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (24576) blob references

27.06.2019 18:10:36.441 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (26624) blob references

27.06.2019 18:10:36.455 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (28672) blob references

27.06.2019 18:10:36.473 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (30720) blob references

27.06.2019 18:10:36.487 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (32768) blob references

27.06.2019 18:10:36.525 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (34816) blob references

27.06.2019 18:10:36.551 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Collected (36864) blob references

27.06.2019 18:10:36.598 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Number of valid blob references marked under mark phase of Blob garbage collection [37965]

27.06.2019 18:10:36.722 *ERROR* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Not all repositories have marked references available : [7e195675-c082-4cdf-8ec2-813ad8194891, 56af03b7-829d-445f-813a-e75681f86188]

27.06.2019 18:10:36.722 *INFO* [sling-oak-observation-75] org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Blob garbage collection completed in 562.6 ms. Number of blobs deleted [0] with max modification time of [2019-06-26 18:10:36.160]

Employee
June 27, 2019

Adding to what Antonio mentioned, there are other maintenance tasks such as Audit log purge, .. etc. They are listed in the earlier update I made.

antoniom5495929
New Participant
June 27, 2019

Hi michaelh28626156​,

I can confirm that we read your post. It's for this reason that i put more detail related to the timing of execution.

Are you sure you are running you datastore garbage collection AFTER the compaction? Otherwise you GC is useless.

Let us know.

Thanks,

Antonio