Lucinity's Input into the FCA's Synthetic Data Feedback Statement
In February 2023, the Financial Conduct Authority (FCA) published a Synthetic Data Feedback Statement that Lucinity contributed to. Read further for a summary of Lucinity's perspective on the challenges and opportunities of using synthetic data, specifically within financial crime.
In March 2022, the FCA released Call for Input to further its understanding of the market maturity of synthetic data within financial services, and its potential to expand data sharing between firms, regulators and other public bodies. The Call for Input sought feedback on the broader challenges in accessing high quality data for innovation in financial services. In February 2023, the Financial Conduct Authority (FCA) published a Synthetic Data Feedback Statement that Lucinity contributed to. Read further for a summary of Lucinity's perspective on the challenges and opportunities of using synthetic data, specifically within financial crime.
Q1: How important do you think access to data is for innovation within financial services? What else do you view as significant barriers to innovation?
Digitization, further advancement, and progress within the financial services industry are like any other industry fueled by advancements in technology. In this specific case, progress is predominantly driven through automation and intelligence, both of which require data as an underlying necessity. Innovation is the process of improving existing processes and therefore is required to be able to test outcomes, improvements, and track metrics – all are impossible without the underlying data and the ability to conduct experiments. If we look a step further at machine learning and deep learning applications in financial services, of which there are a plethora, then data is the alpha and omega, and innovation is significantly halted by the lack thereof.
Another barrier to innovation in the financial services sector is undoubtedly the lack of traditional regulatory sandbox environments. Initiatives such as the FCA sandbox are very welcomed by forward-thinking companies as they enable the testing of highly sophisticated systems within a highly regulated environment.
Lastly, the fuzziness around data privacy and data sharing, both from a regulatory standpoint as well as the entire industry’s risk-averse approach to it, further constitutes a burden to true invention and advancement, especially in areas such as financial crime compliance.
Q2: Do you agree that it is challenging to access high-quality financial data sets? If so, specifically, what challenges do you face?
It is challenging to access high-quality financial data sets. There are a lot of ‘free’ and ‘available’ data sets available online, but their integrity is questionable, and their usability is limited to testing for model generalizability at best.
Buying data sets is increasingly expensive, and the risk-averse stance of most financial institutions makes data partnerships for the sake of innovation staggeringly hard. Moreover, oftentimes financial institutions themselves face technological challenges in creating, maintaining, and using financial data sets, which makes any kind of data collaboration / sharing / extraction very difficult.
Oftentimes, what would be needed from a business and innovation standpoint is the ability to work with a group of financial institutions that want to participate in a safe sandbox environment. This will entail designing the desired data set jointly with the innovator for the specific use case, standardizing it for the innovator and the financial institutions, collecting that data, and providing it in a technologically sound framework, preferably via APIs. Alternatively, the same process can happen, but rather than sharing real data, a data schema can be collectively drawn up, and distributions of the underlying data can be shared so that synthetic data can be created.
Q3: Do you agree with the high-level benefits for synthetic data? Are there any other benefits for synthetic data for your organization, both now and in the future?
Accurate synthetic data solves a lot of innovation and business challenges. Being able to train and test automation and mathematical machine learning models and AI against (realistic!) synthetic data allows companies like Lucinity to not only ensure AML compliance coverage through our AI algorithms but also their generalizability and further development. Having widespread access to synthetic data will also allow the entire ecosystem to innovate much more rapidly and create further network benefits.
For Lucinity specifically, the benefits of synthetic data now lie predominantly in the machine learning and AI side of our software. We aim to further develop, test, and improve our AML detection algorithms so that in the future, synthetic data can be leveraged for our continuous event-driven risk rating modules to model scenarios.
Q4: Does your organization currently generate, use, purchase, or otherwise process synthetic data?
At Lucinity, we currently use synthetic data for a variety of purposes:
- To ensure the quality of our entire product suite by emulating real data for internal testing purposes
- To provide potential customers with a real experience when visiting our open sandbox environment on https://www.lucinity.com/signup
- To further test our AI AML detection algorithms and their generalizability
- To further test our continuous AI risk rating models and their generalizability
- To model shifts and changes in the economy and their potential impact on our AI models
Q5: If your organization generates synthetic data, please describe at a high level the techniques used. Why have you chosen to use this approach?
We are exploring various approaches to generate synthetic data, including the following:
- Auto-encoder approaches to sample and simulate data from the same underlying distributions
- Distribution sampling
- Generative models
They are reliant, robust, and fitting approaches to generate data as complex as financial data to be able to be used in further processing and analysis to train and test machine learning models.
Q6: What do you see as the difficulties and barriers for firms in creating high-utility, privacy-secure synthetic data?
One challenge is to define good business value from synthetic data such that the synthesized data becomes a general source of truth for its intended purpose.
Another is generating high-quality synthetic data to simulate some phenomena, as its quality is often heavily reliant on the knowledge of the underlying distributional properties of said phenomena. One often needs access to high-quality data or expert knowledge to access the knowledge of the underlying properties.
Q7: What do you see as the highest priority use cases that would benefit from synthetic data
- Privacy
- Testing & validation of machine learning models
- Showcasing and testing a data product with customer-like data that is based on synthetic data
Q8: Are the synthetic data use cases you have mentioned significant for early business phases or mature operations/processes within your organization?
This depends on the use case. Early adoption of synthetic data could be highly beneficial for companies that build data products with machine learning components.
Q9: How would your organization make use of synthetic data if it was available (if at all)?
High-quality and reliable external synthetic data could be used to validate our data-driven approach to AML. For example, we could use synthetic data to test the quality and predictive power of Lucinity's machine-learning models. That, of course, assumes that the synthetic data set is reliably labeled.
Q10: What synthetic data sets would you find most valuable to have access to?
For Lucinity's purposes, all of the above. Lucinity uses know-your-customer (KYC) and transactional data to find the anti-money-laundering needle in the haystack. High-quality synthetic data could be used to test the system end-to-end and further develop and validate our data-driven approach.
A challenge in the AML space is working correctly with labeled data. The main challenge arises from the fact that most money laundering cases have historically not been identified. That means that labeled data sets, even synthetic, tend to be unbalanced and give an inaccurate picture of what is happening.
Q11: What requirements would you need for the synthetic data to feasibly meet your use cases?
- Reflect known transactional patterns
- Contain reliable associations between transactions and actors (and preferably all the KYC data)
- Have reliable labels for validation purposes
Read the FCA's full feedback statement below.