Top Secrets of The Efficient Test Data Preparation



You spend a fair amount of time digging into someone’s or your own one-year-old test scenario, bundling different pieces together in memory and trying to understand what is going on here. Once you get the answer, you are still not sure why you needed this or that and whether you needed it at all. At some point, you even start doubting if maintaining such tests is worthwhile. Does it sound familiar to you?

It happens quite often when dealing with integration tests related to the persistence layer. In a database, we usually have a lot of dependencies, references between tables that need to be satisfied during data preparation even if a scenario of interest does not care about such details. To meet the criteria people tend to blindly reuse existing code unintentionally bringing much more details to the test scenario than really needed. I call it information redundancy. This article will highlight its negative outcomes, give you directions on how to solve the problem and even provide a drop-in solution for a popular technology stack Spring + JPA + JUnit/TestNG.

The Value of Integration Testing

Here and further on by “integration testing” we will mean tests involving Spring configuration and persistence layer. Some people think that such tests are evil and should be always avoided in favor of pure (non-integration) unit tests due to their complexity and low speed. The drawbacks of integration testing are undeniable. However, there are a few reasons why you may still find it beneficial:

  • Verifying application configuration. This includes dependency injection, entity mapping, transaction management and so on. The test configuration is always different from production, but if the difference is minimized, a lot of potential problems can still be caught with tests.
  • Catching bugs in third-party libraries. Spring Framework and JPA vendors with their flexibility and variety of usages are good places for hard-to-detect issues to appear and it may happen any time you decide to upgrade your project dependencies.
  • Testing database configuration. This does not happen often because people prefer using in-memory databases for testing, but there are cases when a vendor-specific logic is moved to the persistence layer for performance-related or other reasons and pointing integration tests to a production-like database instance is almost the only way to test the functionality.
  • Making tests easier to write and read. How come?! This is a moot point you wouldn’t expect to see here. Let me clarify: the way I am writing integration tests, which is what I am going to share with you further, is much easier to accomplish than to mock complex dependencies in pure unit tests. You may say that the complexity in tests is a sign that production services need to be refactored. That’s true, however, in practice, people rather tend to continue struggling with mocks and make test scenarios less and less readable anyway. The simplicity of writing integration tests has one drawback though: they crowd out pure unit tests whenever dependency injection and data persistence are involved, leaving the latter just for simple static utility functions, which is not good for performance, of course.

This article is not aimed at either advocating or encouraging the testing of the persistence layer. Your decision may depend on the project type, size, structure, and many other aspects. It rather assumes that you or your team have already decided that writing integration tests does make sense for your project and you are just looking for a better way to do that.

Redundancy in Action

Imagine we have a data structure with three types of entities: User, Post, and Comment. The user may have multiple posts and comments, a post may have multiple comments and all posts and comments must be linked to a user. Further on, I will mention such relationships as parent-to-child, e.g. User is a parent to Post and so on, which has nothing to do with the OOP inheritance, so don’t be confused. The entity mapping may look like this:

@Entity public class User { List<Post> posts; List<Comment> comments; ... } @Entity public class Post { User author; List<Comment> comments; ... } @Entity public class Comment { User author; Post post; ... } 

There is nothing bad with the mapping. Now let’s assume we have the following test scenarios:

public class UserServiceTest { ... public void testDelete() { // given User user = userHelper.createUser(); // when userService.delete(user); // then ... } } public class PostServiceTest { ... public void testDelete() { // given User user = userHelper.createUser(); Post post = user.getPosts().get(0); // when postService.delete(post); // then ... } } public class UserHelper { ... public User createUser() { User user = buildUser(); user.setEmail("hard.code@why.not"); user.setPosts(singletonList( buildPost(user, "some title") )); ... return; } }

Don’t pay attention to the naming convention or any missing supplementary annotations or classes to integrate Spring with our testing framework. Also, we will not be talking about the transactionality of tests or rollbacks. All this has nothing to do with the problem we are going to discuss. Rather take a deeper look at the createUser helper method and its usages. Does it raise any questions in your mind? Why does the helper method use a real service? Does the latter do anything special that may affect user deletion and thereby needs to be tested in pair? Why is a user always created with a post? Is it because the cascade deletion is tested somewhere or is it just because post-related tests have reused the helper initially designed for the user in such a way? Why is email address hard-coded? And so on…

Too many questions for such a simple example, right? And things will get worse when it’s time to test comments since there will be even more relationships between entities. While in the real world it is easy to get a graph of five or even ten related entities.

In this example, you will not find answers by simply looking at the test scenarios, which, by the way, should ideally act as a documentation item. You will likely have to dig into the change history, ask the author of the code, look into production services, combine the research results with your own knowledge and finally guess the answer with a bit of uncertainty. This relatively simple example does not show the true level of complexity on a real project, it only demonstrates what level of confusion seemingly innocent changes may cause.

Negative Outcomes of Redundancy

Did you notice what was the most frequently asked question in the previous section? I have asked a few people which of the five W questions they think is the most difficult for humanity and most people have chosen “Why” without any hesitation. The mystery of “Why” is that an intention or a reason often lives in a human mind and the truth becomes harder to restore as time comes.

When people see something done, time and energy spent, they usually expect there to be a reason behind that. If you find the reason the question is closed. Otherwise, you either continue researching or give up the idea of getting to the truth. Redundancy is always an action without any visible reason and it may give you a false impression that you don’t see the full picture yet. In such situations, it may seem to be safer to leave everything as it is and find minimal changes that would satisfy your current needs. As a result, even more useless code may be introduced increasing the future research efforts.

In the example above we could see two types of redundancy:

  • Unnecessary details. It happens when the exact values are set for certain fields making it unclear how important they are for a given scenario. A hard-coded email is not a disaster, but having a dozen of magic numbers, enums and values of other types may become a significant overhead.
  • Needless data. It means that extra records are created making it unclear how they are related to a given scenario. In addition to confusion, it also reduces performance. Needless data, in its turn, may also include unnecessary details thereby increasing the complexity even more.

Both types of redundancy give you much bigger context than is really needed, concealing the important things and misleading you.

Tests Are Simpler Than You Think

Redundancy in conjunction with the code inefficiency (when the same result could have been gained with fewer steps) makes you think that integration scenarios are much more complex than they really are and spending efforts on their maintenance is inevitable. In fact, most of your test cases are simple enough, they could have been only messed up in favor of the remaining minority. Not sure about the ratio? How many post-related tests will consider the details of their parent user? Maybe a few of them will only “care” about a post belonging to a user if you decide to test access control, for example, but not a single test will need to “know” anything about email address or any other user-specific field. How many user-related tests will consider the concrete value of the email address? At most, a few ones related to email validation. And in the case of a field that allows any character sequence, you will never care about its exact value at all. At maximum, you could worry about its value length if there are any size limits to be tested. I hope it is easy to see that on average for any given field of a particular entity there will be just a few test scenarios that would care about its exact value.

Randomness and Minimalism

So far we have been putting the blame on redundancy and showing its evilness. But how can we meet not null or foreign key constraints in a database without setting the exact values before persisting data? Randomness and minimalism are the keys to success! Randomness allows to avoid unnecessary details and minimalism stands for preventing needless data. Let’s take a look at a few examples.

What would the line below say to you if you knew for sure that the createUser helper function creates a User entity with all fields randomly generated and without any other linked entities?

User user = createUser();

It is easy to guess that such test needs just a user, any user.
Given the same assumption, it is easy to guess that here we need nothing more than a user with a predefined email:

User user = createUser(user -> user.setEmail("hard.code@why.not"));

These two lines create any user with any post, e.g. to test cascade deletion:

User user = createUser(); createPost(user);

Note that in the example above the post is created explicitly instead of putting any magic code inside the factory method for a user.

Here we create just a post and don’t care about its parent user, while the latter must be randomly generated inside the factory method itself:

Post post = createPost();

Similarly, here we generate any comment and don’t care about its parent post or any other cascaded parents:

Comment comment = createComment();

Quite intuitive and readable, isn’t it? If you make it a rule to create an entity of interest only along with its mandatory parents just to meet foreign key constraints, you will leave no doubts to other guys as to what is really needed for this or that test scenario. Randomness, however, may mean either “any” or “not important”. The difference is very obvious and it should not raise any questions. If you see a generated value appear in the data verification, it means “any”, otherwise it is “not important” and populated just in case there are any not null constraints in the database. The latter, by the way, is the reason why I prefer always setting random values instead of null by default. If you need to test nulls you may set them explicitly whenever you want.

Entity Factory Solution

The fact that for any given field of any entity most test scenarios will not care about its exact value brings us to the idea of introducing a factory that would create completely random entities with minimal dependencies initialized by default and providing a mechanism that would allow customizing any entity being created on demand. For Spring + JPA, it appears to be extremely easy to implement. I call the solution entity factory and prefer splitting it into three classes:

  • RandomUtils – a static helper for generating random objects of any types.
  • EnityHelper – a Spring-managed bean that simplifies interaction with the database.
  • EntityFactory – a Spring-managed bean for creating persistent entities.

As you will see soon, EntityFactory is the only project-specific class to be written individually, while the other ones are completely generic and can be copied from one project to another. Cross-project reusability is one of the reasons for such separation. Moreover, RandomUtils can also be reused in pure unit tests.

The concrete implementation of the helper classes may slightly differ based on how you handle bean registration, dependency injection and transaction management, but the difference will be negligible.


RandomUtils is responsible for creating instances of any requested types with all fields randomly populated. Currently, I am using the Easy Random library for that and it completely satisfies all my needs:

import io.github.benas.randombeans.EnhancedRandomBuilder; import io.github.benas.randombeans.api.EnhancedRandom; import io.github.benas.randombeans.api.Randomizer; import io.github.benas.randombeans.randomizers.text.StringRandomizer; import java.util.Random; import static java.lang.Math.abs; public class RandomUtils { private static final Random RANDOM = new Random(); private static final EnhancedRandom ENHANCED_RANDOM = new EnhancedRandomBuilder() // 1) important: make string size fixed and big enough to guarantee uniqueness .randomize(String.class, StringRandomizer.aNewStringRandomizer(10, 10, 0)) // 2) important: make all nested collections empty by default .collectionSizeRange(0, 0) // 3) nice to have: make integer values positive .randomize(Integer.class, (Randomizer<Integer>) () -> abs(RANDOM.nextInt())) .randomize(Long.class, (Randomizer<Long>) () -> abs(RANDOM.nextLong())) .build(); public static <T> T random(Class<T> type, String... excludedFields) { return ENHANCED_RANDOM.nextObject(type, excludedFields); } } 

Pay attention to a few customizations that I have made. First of all, string size is fixed to 10 by default. You may choose a different value, but keep in mind two things when making the decision: the size should be big enough to guarantee the uniqueness of generated values with a high degree in cases when there are any unique constraints in your database and small enough to fit most of your size limits. Don’t worry if the size does not fit into one or two columns: you will be able to override the default behavior whenever you want.

The second customization ensures we keep all nested collections empty. They correspond to the optional child entities we do not want to be created by default.

The last customization is rather nice to have and simply makes all integer values positive. The reason is that they often represent counters or identifiers and if you are used to seeing them positive in real data flows having negative values in tests may look confusing and even scary. It does not hurt much making them positive by default even if some fields allow negative values as well. You will be able to customize this behavior anyway.

In one project we had a custom written RandomUtils clever enough to take into account metadata from annotations like @Column(length = 100) and many others. In fact, it was not so difficult to implement and if you like coding you may introduce something like that in your project, but the implementation provided above is more than enough to get started.

One more thing to notice is that the factory method allows excluding fields from being generated. This will be needed to keep primary keys initially empty and delegate their generation to the persistence provider.


The main purpose of this bean is to generate and persist an entity of any given type. It should also support a callback function to allow generating and linking parent entities if any and doing on-demand entity customization.

Here is a JPA-based implementation:

import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Repository; import org.springframework.transaction.annotation.Transactional; import javax.persistence.EntityManager; import javax.persistence.EntityManagerFactory; import javax.persistence.PersistenceContext; import javax.persistence.metamodel.SingularAttribute; import java.util.function.Consumer; @Repository @Transactional public class EntityHelper { @PersistenceContext private EntityManager entityManager; @Autowired private EntityManagerFactory entityManagerFactory; public <T> T create(Class<T> clazz) { return create(clazz, e -> {}); } public <T> T create(Class<T> clazz, Consumer<T> callback) { T entity = RandomUtils.random(clazz, getIdFieldName(clazz)); callback.accept(entity); entityManager.persist(entity); return entity; } private String getIdFieldName(Class<?> clazz) { return entityManagerFactory.getMetamodel() .entity(clazz) .getSingularAttributes() .stream() .filter(SingularAttribute::isId) .findFirst() .map(SingularAttribute::getName) .orElseThrow(() -> new IllegalArgumentException("Cannot get id field name for " + clazz)); } } 

If you happen to use native Hibernate API, the only problem you may have is determining id field names dynamically. You may either check the API and try to find a similar solution or reset randomly generated ids back to null manually using callback functions. Or you may simply hardcode the value right in the helper if you are lucky to follow a convention and give all ids identical name. Anyway, whatever workaround you choose it will still be encapsulated inside the helper classes and keep your tests scenarios clean of such technical details.

You might be wondering why I gave the EntityHelper such a neutral name. The reason is that I am placing a few more generic helper methods there and using it for data verification as well. Here is the entire list of methods I usually have:

 <T> T create(Class<T> clazz); <T> T create(Class<T> clazz, Consumer<T> callback); void remove(Object entity); <T> void removeAll(Class<T> clazz); <T> void removeAll(Class<T> clazz, Map<String, Object> where); void merge(Object entity); <T> T find(Class<T> clazz, Object id, String... lazyFieldsToInitialize); <T> T find(T entity, String... lazyFieldsToInitialize); <T> List<T> findAll(Class<T> clazz); <T> List<T> findAll(Class<T> clazz, Map<String, Object> where, String... lazyFieldsToInitialize); 

I won’t provide their implementation because it would be noticeably different for JPA and native Hibernate and quite straightforward to implement having basic knowledge of your persistence provider API at the same time. Also, this is the biggest list of helper methods I’ve ever used and you don’t necessarily need all of them in your project especially on the early stages. Later you will see a few examples of how they can be used.


This is the bean to be used for data preparation directly in your test scenarios. It should provide the simplest factory methods for all types of entities in your application. Internally it will be delegating the job to EntityHelper, but if there are any dependencies between entities it should make sure the parent entities are always generated first and linked to the child ones afterward. If needed, the factory may also contain overloaded methods to allow passing parent entities to link the generated ones. This example of EntityFactory should clear things up:

import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Component; @Component public class EntityFactory { @Autowired private EntityHelper entityHelper; public User createUser() { return entityHelper.create(User.class); } public Post createPost() { return createPost(createUser()); } public Post createPost(User author) { return entityHelper.create(Post.class, post -> { post.setAuthor(author); author.getPosts().add(post); }); } public Comment createComment() { return createComment(createUser()); } public Comment createComment(User author) { return createComment(author, createPost()); } public Comment createComment(Post post) { return createComment(createUser(), post); } public Comment createComment(User author, Post post) { return entityHelper.create(Comment.class, comment -> { comment.setPost(post); post.getComments().add(comment); comment.setAuthor(author); author.getComments().add(comment); }); } } 

Further on, I will give you more directions on how to keep the solution clean. For now, I will only say that for the given data structure I would not add any more overloaded methods to the factory than listed above.

Entity Factory in Action

Now we have everything to start building compact and readable test scenarios. Let’s see how the test scenarios we discussed at the very beginning could look like:

public class UserServiceTest { @Autowired private EntityHelper entityHelper; @Autowired private EntityFactory entityFactory; public void testDelete() { // given User user = entityFactory.createUser(); // when userService.delete(user); // then assertNull(entityHelper.find(user)); } public void testDelete_CascadeDeletion() { // given User user = entityFactory.createUser(); Post post = entityFactory.createPost(user); // look how flexible we are building different input combinations! entityFactory.createComment(user); entityFactory.createComment(post); entityFactory.createComment(user, post); // when userService.delete(user); // then assertNull(entityHelper.find(user)); ... } } public class PostServiceTest { @Autowired private EntityHelper entityHelper; @Autowired private EntityFactory entityFactory; public void testDelete() { // given Post post = entityFactory.createPost(); // when postService.delete(post); // then assertNull(entityHelper.find(post)); } // let’s imagine there was a defect that caused deletion of all user comments, not only the ones tied to a given post public void testDelete_OtherUserCommentsNotAffected() { // given User user = entityFactory.createUser(); Post post = entityFactory.createPost(); Comment postComment = entityFactory.createComment(user, post); Comment otherComment = entityFactory.createComment(user); // when postService.delete(post); // then assertNull(entityHelper.find(postComment)); assertNotNull(entityHelper.find(otherComment)); } }

Here we have even added a few more sophisticated tests scenarios. Do they raise any questions as we had before? I hope not. The main advantage of such an approach is that you always see what is needed and you know that what you see is really needed. And if there is a field populated just in case, you will not see it explicitly set anywhere!

If you find the example above not impressive enough and the savings in your code are not sufficient I will give you some real numbers. In one project, I took an old revision of the code base where the proposed solution has not been introduced yet, randomly picked a middle size test scenario and started refactoring it using the new approach. I went through several helper functions with more than 30 lines of code in total filtering out useless information just to get convinced that the whole data preparation could have been replaced with two lines (I have only changed real names without breaking the logic):

Comment comment = entityFactory.createComment(); comment.setText(randomString(COMMENT_SIZE_LIMIT + 1)); 

I bet many people will guess what is being tested here even without seeing the rest of the code.

Another good thing about this approach is that to start using it you don’t have to make any changes in the existing test scenarios. You can either create a brand new test method or refactor an existing one using the entity factory without a need to rewrite code everything at once. This makes it possible to try the solution without any risk and perform further refactoring incrementally as time allows.

Protect Simplicity and Localize Complexity

The proposed solution is efficient as long as it is kept simple. If you let other developers do whatever they want after the solution has been introduced, you will likely see them adding more and more scenario-specific code to the EntityFactory just because it is the fastest way of solving their pressing problems. Not everyone is able to foresee all potential side effects and global outcomes. That is why it will require some efforts to protect the untouchability of the simple thing that covers 99% of your daily routine and enforce moving deviations for the remaining 1% to different places, preferably as close to the corresponding problems as possible. In short, RandomUtils, EntityHelper, and EntityFactory as they were presented above is the only thing you need to put to a shared place and they already provide everything to make any on-demand customization easy enough. Here are a few tips on how to protect your breadwinner:

  • Do not create overloaded methods with input parameters other than parent entities:
    Post createPost(User user) { ... } // Yes Post createPost(String title) { ... } // No! Post createPost(Comment comment) { ... } // No!
  • Do not set any exact values, populate only mandatory parent entities:
    public Post createPost() { return create(Post.class, post -> post.setAuthor(createUser())); // Yes } public Post createPost() { return create(Post.class, post -> post.setDraft(false)); // No! }
  • Do not create optional child entities by default, explicitly create them when needed:
    User user = entityFactory.createUser(); Post post = entityFactory.createPost(user); // Yes public User createUser() { return create(User.class, user -> user.setPosts(...)); // No! }
  • Don’t try to make generated data valid, let it be as random as technically possible, invalid data will not hurt as long as you keep your tests decoupled enough:
    public User createUser() { return create(User.class, user -> user.setLanguage(randomString(2))); // Yes } public User createUser() { return create(User.class, user -> user.setLanguage("en")); // No! }

If there is a need to customize the default behaviour of factory methods try this first:

  • Inlining the code in your test method:
    Use user = entityHelper.create(User.class, user -> user.setName(name));
    Moving it to a private function of your test class
    private Post createPost(String title) {
        Post post = entityFactory.createPost();
        return post;
  • Creating an additional flow-specific factory:
    public class UserActivationFactory {
        public User createActivatedUser() {
            // non-trivial and less intuitive data preparation goes here

The options are sorted by their preferability. I am frequently using the first option and think twice before using the other two. Sometimes I even prefer duplicating a few lines instead of creating a new helper function or class. The latter is considered to be the last resort if multiple tests require the same or similar non-trivial data preparation.

Even though some of the examples of bad coding above may not seem to be too scary, they will become a starting point for polluting shared helpers with non-intuitive code and reverting everything back to where we were at the very beginning. It is worth mentioning that most of these points are not so much restrictions as suggestions to move the code to the right places.

Any Word About Third-Party Solutions?

There are a few popular third-party solutions aimed at helping to prepare and even verify persistent data, such as:

  • Spring Testing Framework itself provides several ways to initialize the database state using SQL scripts.
  • DBUnit provides utilities for database state initialization and verification using externalized data files as a rule.
  • Spring DBUnit provides integration between Spring Testing Framework and DBUnit, wraps up the boilerplate DBUnit code into annotations.
  • DbSetup provides builder-based utilities only for initializing the database state.I

I haven’t said any word about them so far just because it would be difficult to talk about their weaknesses without knowing how it could be made better. Now, think how much effort it will cost to write a similar test scenario using any of them:

User user = entityFactory.createUser();
User found = userService.findByEmail(user.getEmail());
assertEquals(found, user);

First of all, you will not likely find any other solution that would make the scenario as compact as it is shown here (if you do find, please share it with me 🙂 ). Second of all, none of the solutions may produce a reference copy of domain-specific entity as an output of data preparation, which is quite expected, while in the majority of cases it would be very handy during data verification to compare the action result with. For example, with SQL scripts or DBUnit data files you would probably have to hard-code the same sample email address in both input file and java code, thereby repeating the information. Even with DbSetup, you would have to either duplicate such information or unduly increase the amount of code by moving shared information to variables. Moreover, the latter provides nothing for the database verification, which restricts its usages and makes you write custom helpers for data verification anyway. Of course, you can try writing some custom code that would adapt third-party interfaces to your project-specific domain model, but how different will it be from the entity factory approach we have discussed before?

Another drawback of datafile-oriented approaches is that none of them provides an easy way to hide unnecessary details. Whether it is an SQL script of an input XML file we still have to explicitly specify values for mandatory columns. Neither conventions (making certain values mean “any”) nor replacements (passing random values from java code to data file templates) will help much. Moreover, an attempt to reuse the boilerplate data files sooner or later results in needless data being introduced to almost every test scenario. Overall, I find it extremely hard to maintain externalized data as soon as the project size becomes non-trivial, especially when requirements change and there is a need to make global modifications.

In no way I want to say that the third-party libraries are worthless. If you don’t use dependency injection or object-relational mapping or simply want to the test data access layer using fully independent utilities, one of the existing libraries could be a good choice. I’ve used many of them when there was no way to apply the entity factory approach and I can’t say it was a bad experience. I just know how much better it could be :).


Redundancy in tests raises unnecessary questions and makes you doubt that the scenario of interest has been fully understood. It prevents people from cleaning their code and making seemingly obvious improvements.

Unlike the lack of details, redundancy itself never fails tests which makes it difficult to track and fix once it has been introduced. Always try to minimize the data used in tests. Randomizing values may help you to populate mandatory fields without bringing unwanted details.

Test scenarios are simpler than they may seem to be. Most of them do not care about the exact value of any given field of any entity. Randomizing the value and overriding the default behavior on demand is all you need to avoid redundancy in the majority of cases.

Check if your technology stack allows creating a factory that would generate random persistent data with minimal dependencies. To keep this solution efficient, you will only have to protect its simplicity and enforce moving any possible customizations close to the corresponding problems, which, in addition, will better document your test scenarios.

About the Author