Aem segmentstore vs datastore | Community
Skip to main content
robertol6836527
New Participant
March 13, 2024
Solved

Aem segmentstore vs datastore

  • March 13, 2024
  • 4 replies
  • 3143 views

HI,

 

In Adobe Aem which contents are saved in the segmentstore and which are saved in the datastore?

 

What is the difference between the two?

 

Thank you

Best answer by Tad_Reeves

Hey Roberto - 

In overly-simplistic terms, the Segmentstore contains all of the metadata about each of the nodes in the repository, including the content tree itself, and all of the textual information about each node.  The Datastore is used for larger objects that need to be stored in AEM, which could be text data or binaries. 

 

There's a base configuration in Jackrabbit Oak which tells the system what the minimum size is of objects that should be stored in the Datastore vs the Segmentstore, which I believe still sits at 16KB.  So, let's say you add a page node to AEM, and then put 2KB of text in the page. That entire object is then living in the segmentstore.  But let's say you upload a 100KB PDF - in that case, the metadata about the PDF (its title, description, jcr properties, location, tags, etc) all are physically stored in the segmentstore, but the binary data of the PDF itself is stored in the datastore with only pointers in the segmentstore to where to find it.

 

This is why there are two different maintenance jobs in AEM - one to clean up the segmentstore, the other to clean up the datastore.  If, let's say, the reference to that PDF is deleted in AEM, it would only get flagged for deletion in the segmentstore, but would then still be on disk.  The revision clean-up job would then be able to reclaim that disk space out of the segmentstore when it runs, but the datastore would still contain the binary data for that PDF until the datastore cleanup job gets run, to remove any now-unreferenced objects out of the datastore. 

 

Hope that helps! 

4 replies

Tad_Reeves
Tad_ReevesAccepted solution
New Participant
March 14, 2024

Hey Roberto - 

In overly-simplistic terms, the Segmentstore contains all of the metadata about each of the nodes in the repository, including the content tree itself, and all of the textual information about each node.  The Datastore is used for larger objects that need to be stored in AEM, which could be text data or binaries. 

 

There's a base configuration in Jackrabbit Oak which tells the system what the minimum size is of objects that should be stored in the Datastore vs the Segmentstore, which I believe still sits at 16KB.  So, let's say you add a page node to AEM, and then put 2KB of text in the page. That entire object is then living in the segmentstore.  But let's say you upload a 100KB PDF - in that case, the metadata about the PDF (its title, description, jcr properties, location, tags, etc) all are physically stored in the segmentstore, but the binary data of the PDF itself is stored in the datastore with only pointers in the segmentstore to where to find it.

 

This is why there are two different maintenance jobs in AEM - one to clean up the segmentstore, the other to clean up the datastore.  If, let's say, the reference to that PDF is deleted in AEM, it would only get flagged for deletion in the segmentstore, but would then still be on disk.  The revision clean-up job would then be able to reclaim that disk space out of the segmentstore when it runs, but the datastore would still contain the binary data for that PDF until the datastore cleanup job gets run, to remove any now-unreferenced objects out of the datastore. 

 

Hope that helps! 

New Participant
April 3, 2024

Hello,

Please also suggest on below:

In practice, should full GC revision purge run before data Store GC, or after?

Lets say my revision cleanup (maintaining segmentstore) is running daily with Full GC on Sunday, so weekly Datastore GC should be Monday or Friday?

Does this order affect the disk storage and AEM performance during the week?

Tad_Reeves
New Participant
April 3, 2024

You want to run the revision GC (i.e. tar compaction) before the Datastore GC.  The Datastore GC depends on the revision GC to know what blobs it can remove that are no longer referenced.  A Datastore GC run by itself (i.e. without any revision GC) won't reclaim anything.  So, let's say you run a full revision GC on Saturday, you could then run your Datastore GC on Sunday so as to take advantage of the earlier cleanup.  

kautuk_sahni
Employee
March 14, 2024

@robertol6836527 Did you find the suggestions from users helpful? Please let us know if more information is required. Otherwise, please mark the answer as correct for posterity. If you have found out solution yourself, please share it with the community.

Kautuk Sahni
somen-sarkar
New Participant
March 14, 2024

HI @robertol6836527 ,

In Adobe Experience Manager (AEM), the content repository is divided into two main storage areas: the segment store and the data store . They handle different types of data:

Segment store: This stores the content and properties of your AEM pages. It essentially holds the metadata that describes your content. AEM uses a segment store implementation called TarMK by default, which stores this data in TAR files.

Data store: This stores the binary data associated with your content, such as images, videos, and documents. Data stores are separate from the segment store to improve performance and scalability. AEM can use various data store options, including a default file system data store or external options like Amazon S3.

 

Thanks,

Somen