Consumer price index from big data mining perspective (Big data)

17 November, 2021

Consumer price and consumer price index play an essential role in management and administration of State’s macro-policies, contributing to the development of production and business activities and international trade. In developed countries around the world, big data has been applied to calculate consumer price index and achieved outstanding results. Acknowledging the necessity and relevance of global trend and aiming at contributing a part to the big data mining experience in Vietnam, the authors from UEH School of Technology and Design conducted “Big data mining in consumer price index calculation in Vietnam (Ho Chi Minh City case)” research. This article introduces price-collecting method from websites for consumer price index calculation besides challenges as well as presenting recommendations for quality improvement towards current consumer price information collection.

Current traditional consumer price index approaches

Consumer prices and consumer price index play a crucial role in management and administration of State’ macro policies; listing as: policies on monetary and financial management, inflation management, banking interest rate adjustments, exchange rate adjustments, regional socio-economic development policies, salary policies and so on, all of which contribute to the development of production and business activities and international trade.

Consumer price index (CPI) is the relative indicator (in %) that reflects the trend and level of consumer price fluctuations of items over time in a representative basket of consumer goods and services. Consumer price refers to the amount of money that a consumer has to pay when purchasing a unit of a good or service that is directly used in daily life. Consumer price is expressed as the goods’ retail prices on the market or those of services serving people’ daily life. In the case that goods or services are to be without a listed price that the buyer can bargain, the consumer price is the price that the buyer actually pays after negotiating with the seller.

In Vietnam, statistical information on consumer price index is collected from consumer price survey conducted by General Statistics Office in all 63 Provinces and Cities and published monthly on the last day of the month. In consumer price survey plan for 2014-2019 period, Statistics Department conducted a consumer price survey by sampling method a basket of 654 goods and services at stalls and stalls at markets, selling venues (specializing in retail), business and service establishments and so on that have stable business locations. Goods and services in goods basket are divided into 3 main groups: The 1st group only investigates once a month on the 10th of every month; The 2nd group will survey 3 times a month on the 1st, 10th and 20th of every month; The 3rd group will be conducted in accordance with the number of occurrences in the month.

On the other hand, the traditional survey method also has some shortcomings. Firstly, questionnaire collection at local places faces many difficulties, especially, the collection time taking place on public holidays, Tet or social-distancing implementation local time and so on causes the fact that most businesses will not open for sale and the prices of goods and services often have large fluctuations at these times, all of which are also a limitation that needs to be dealt with. Secondly, non-sampling errors still arise during collection process. Thirdly, processing difficulties related to goods and services with short life cycles do not exist at survey period and many new goods arise during the survey period. In addition, problems regarding survey funding, sample selection and number calculation of local stabilized-goods selling venues, settlement concerning stores in survey sample that are increasingly narrowing their business, decreased market share and so on are additional difficulties during data collection in accordance with traditional survey method.

Approaching consumer price index from Big Data mining in digital era

Industrial revolution 4.0 foundation, along with digital economy development, has created new data sources – an amazing opportunity for statistics industry to improve quality and efficiency of information collection mission.

This online source of price data will help price statistics to measure price changes more accurately, expand sample size, identify substitutes accurately based on consumer behavior, precisely measure price changes, reduce or eliminate dependency from respondents and reduce survey costs. In addition, data sources collected from Web sites will help to improve collection time with more details in quantity and diversity, higher more data collection frequency without increasing costs.

To find out the process and calculating method regarding Consumer Price Index in accordance with big data approach, the authors conducted researches and collected information from 28 websites of companies and E-commerce platforms with 246,069 items.

Online source of price data will present a beneficial opportunity for statistics industry to deal with the challenges that traditional consumer price statistics are facing. Price collecting method from big data has a much bigger number of items than those of the traditional method, approximately 250,000 items in a 3-month period; therefore, CPI calculated from big data implies that fluctuations concerning market price fluctuations are more sensitive and stable. If an item in the goods basket of traditional CPI (especially those only collected within 1-month period) fluctuates strongly (increased or decreased deeply), overall CPI will be greatly affected. Big data, on the other hand, won’t be impacted greatly as a lot of other goods in the same group are collected without any big volatility.

In addition, collecting method from big data is more representative in terms of sample because the fact that the prices of all goods are collected implies that the sample can be considered as representative of all the goods available in the market. For traditional CPI, in contrast, the goods basket is updated every 5 years; consequently, many popular consumer goods have appeared on the market but not been updated. Moreover, that the goods basket size is only 654 popular representative consumer products indicates the lack of more complete goods basket compared to the constant fluctuation reality of the market.

Data source Operation and exploitation in consumer price index calculation

Mining big data for statistics is a new content, human resources in this field are being in lack of in quantity besides certain limitations in quality. System operation, software development, data mining and so on are necessary skills to perform this job well. Therefore, educational institutions need to improve their training quality as well as training programs must adhere to reality and meet job requirements. On that basis, the trainees are able to handle practical problems related to big data during their being recruited. At the same time, our State should to develop a coordination mechanism and create favorable conditions for a close relationship between the employers and UEH.

Regarding training organizing, for statistics industry, it is necessary to focus on training, capacity and skill improvement, information technology level application enhancement, knowledge and skill equipment related to data for specialized staff performing professional work as well as information technology workers in the whole industry. Researching and applying data science and modern advanced technologies, suitable for statistical work in Vietnam.

Mentioning big data, a mostly-referred concept is Machine learning (ML) that teaches computers to do what naturally humans can do. The key is to learn from experience. Currently, the authors are manually performing item coding in accordance with the representative good-and-service list (commodity code level 5) of consumer price index. Economic subdivision and data coding are routine, necessary and essential statistical work to ensure that all collected data are comparable. Therefore, if researches can apply machine learning algorithms in data encryption work, the massive workload will be greatly reduced.

From authors’ suggestion, first, it is possible to choose a small sample in the data set for encryption experts; second, conduct algorithms to help the machine learn the work and experience from experts’ encrypted models; finally, use this machine learning technique to classify or encode the rest of the data using the work learned from experts’ coding. During the implementation process, to ensure accuracy and reliability of encrypted data, it is possible to repeat the operation many times: choosing a small sample, machine learning, expanding the sample for coding machine, checking samples, deploying the whole data if the rate is satisfactory, expanding the samples so that the machine learns more if the rate is unsatisfactory and so on. Automatic coding like the above process will help earlier data compilation and publication; therefore, the published data will become more valuable to users.

Machine Learning is a mostly-mentioned concept when it comes to big data (Big data).

What is more, Big Data exploitation and analysis requires a well-developed information technology infrastructure and outstanding technologies. To develop this field, it is necessary to focus on developing investment in computing infrastructure: upgrading information technology infrastructure, leasing information technology infrastructure to serve the information collection system by electronic ballot; capacity enhancement of Server Center; emphasizing on implementing an online information collection system on websites and processing information in real time; building an overall architecture so that the system is open, ready to integrate, upgrade and expand the system in necessity.

In terms of funding, most modern information technology systems often require high costs, in addition to additional costs (for example: operation and maintenance costs, especially, the cost of operating systems, software copyright, safety equipment, network security and so on). Therefore, upon information extraction deployment from big data, it is necessary to consider investment effectiveness.

Besides, it is necessary to pay attention to other alternative data source supplement: data sources from providers and E-commerce market development, factors that have contributed to quality improvement regarding consumer price index.

Along with digital era development, Big Data advent is an opportunity in boosting the economy if we should acknowledge how to take full advantage.

Please refer to the ful research titled “Big data mining in sonsumer price index calculation” here.

Author group: PhD. Hà Văn Sơn – UEH School of Technoloy and Design, MA. Nguyễn Thanh Bình – Statistical Department at Ho Chi Minh City. 

This paper is in Series of Spreading research and applied knowledge from UEH. We would like to invite our distinguished readers to looke forward to Knowledge News ECONOMY NUMBER  #14  “Smart City with human-centered principle”.

News, Photos: Author group, Department of Marketing and Communication.